The problem
Importance, efficient solvability
Superset of LP. Many problems in nature are convex optimization problems. Many non-convex problems have convex equivalents : see section on modelling/ specifying optimization problems.
(Nesterov, Nemorinsky) For any convex optimization problem, there exists self concordant barrier functions; so interior point methods and barrier methods can be made applicable: This is a non constructive proof. So, there exists a polynomial time algorithm for every convex optimization problem, but you will find it with certainty only if you can find these self concordant barrier functions.
Standard form
All
Convexity of feasible region X
Optimization fn is convex, constraint sets are convex sets. So, X is convex.
Geometry
\(\gradient f_0(x^{})\), if
Identifying convex opt problems
Check Convexity of fesible regions
Maybe compare with known convex sets (see vector spaces survey).
Any equality constraints should be linear.
Eg:
Properties
Local optimum is the global optimum
By contradiction: Take radius R local optimum x; suppose there were y more than R away from
Lagrangian dual functional
Can easily get the Lagrangian dual functional:
Certificate of optimality with strong duality
Aka KKT certificate. If feasible
Pf: From complimentary slackness:
Bound norm of solution
All sublevelsets of
Eg: This technique was used in bounding the deviation of the solution of a l1 regularized logistic regression problem from the actual parameters defining an ising model by pradeep etal.
Dual problem
Strong Duality usually holds. ‘Constraint qualifications’ can tell you whether strong duality holds.
Strict feasibility Constraint qualification
(Slater) If there exists some strictly (primal) feasible
Actually, can relax constraint qualification to:
\exclaim{Applies to convex conic inequality constraints too!} Also, this is a way to get strong duality without using the global optimality criteria perspective.
From supporting hyperplanes view
Consider the supporting hyperplanes view of the dual problem (see dual problem section). For convex optimization problems, convexity of
When the primal is strictly feasible, the supporting hyperplane at
Unconstrained problems: algorithms
Problem, algorithm framework
Problem, Assumptions
Problem:
Algorithm framework
Produce sequence
Alternate notation:
As Iteratively solving gradient eqns
Can be interpreted as iterative methods\
for solving optimality condition
Common assumptions
Assumptions about f
f is twice continuously differentiable, so dom(f) is open.
Strong convexity
If f is strongly convex,
Also guarantees that sublevelsets are bounded:
Initial point: Assumptions
Sublevel set
Descent methods
Algorithm
This is guaranteed to eventually come arbitrarily close to the minimum.
Descent direction
Find search direction
\label{descent:Search Direction} See later sections.
Search direction from optimality conditions of approximation
Aka Newton equations in some cases.
Maybe model f(x) with
For examples, see section on 2nd order approximation descent.
Line search for t
Restriction to a slice
Visualization
Plot
Exact line search
Backtracking
Parameters:
Start with t=1 (or small enough
Can be rewritten as
Guaranteed reduction in objective
As
So, when
Such a
As secant in epigraph of g(t)
Take
x changes along the search direction only, but you change
As
Variants
Rather than ensure that
Arbitrary choice
Using backtracking rule or doing exact search during line search guarantee reduction in the objective (and therefore convergence), but they can be expensive. An alternative can be to just use 1 as the step length.
The cost of doing so is that convergence is no longer guaranteed: it is possible for example that, taking the step 1 repeatedly, one cycles between the same pair of points between which lies the optimal point. However, in practice, convergence is usually observed.
This technique is used, for example, in \ ‘iterateively reweighted least squares’, where each step involves solving a least squares problems (which may correspond to the local 2nd order approximation).
Steepest descent
Aka Gradient descent.
As 1st order approximation minimization
As
Descent direction
For a given
Stopping criterion
Use
Convergence
\(f(x^{(k)}) - p^{} \leq c^{k}(f(x^{(0)}) - p^{})\). \why
Geometric view
2-dim functional example
Consider contours/ level sets of
Goodness for circular level set case
If the level set contours are circular, then
Zig-zagness of path towards optimum
In this case, despite finding the best step size, the sequence of points produced is not a direct line towards
Implications of choice of norm
If you choose a good
But, if you pick a bad norm, Eg: \exclaim{
2nd order approximation descent
Aka Newton method (reason described below), but not Gauss-Newton method, which is a further approximation and is specific to least squares problem.
General search direction
Take
Search direction from Hessian
This is Aka Newton’s method, as it corresponds to solving for the root of the optimality condition
Solving Newton equations fast
Solving the linear system of equations:
Correctly computing gradient and Hessian
See comments from the gradient descent case.
Geometric view
Consider the 2-dimensional example described in the ‘geometric view of steepest descent’ section.
Local approximation by ellipses
Note: picking
2nd order approximation ellipses
Choosing
Affine invariance
The newton method is supposed to be invariant to affine transformation of the input: can confirm by computing
Stopping criterion: Affine invariant
Take
Estimates proximity to
Equals length of
Also measures the
This is also affine invariant, unlike
Speed: comparison with 1st order methods
Computing the search direction makes 2nd order methods slow in general; but if some structure in
Each iteration of 1st order methods are usually much faster, but many more iterations are required.
Convergence: classical bounds
Assumptions
f strongly convex with constant m.
Linear decrease phase
Aka Damped Newton Phase.
This phase ends after atmost
Most iterations require backtracking search.
Quadratically convergent phase
So, for
All iterations use step size
\exclaim{Observing these on a (maybe log) residue vs iteration and step size vs iteration plots is a good way of verifying correct implementation of gradients etc..!}
Overall bounds, defects
To achieve \(f(x) - p^{} \leq \eps\): \(\frac{f(x^{(0)} - p^{})}{\gamma} + \log \log (\eps_0/ \eps)\) iterations needed, where
Provides qualitative insight into behaviour of 2nd order approx descent.
But, constants m, L are usually unknown: so cannot say beforehand when convergence will happen.
Bounds are not affine invariant, even though the 2nd order approx descent method is.
Convergence for self concordant functions
(Nesterov, Nemorinsky) Get better bounds, which don’t suffer from the defects of classical analysis.
Alternating minimization
Consider f(x, y); repeat these steps: \
When minimization is done one coordinate at a time, it is called coordinate descent; if it done one coordinate-set at a time, it is called block-coordinate descent.
Diagnosing error in code
Incorrect gradient computation: symptoms
Often algebraic mistakes happen while computing the gradient.
To detect such a case, compare with the numerically computed gradient: See numerical analysis survey.
Or, observe that the step size found in the direction of the gradient is very small.
Equality constrained problems
Problem, assumptions
Common Assumptions
f is twice differentiable, A has full row rank: can always get an equivalent problem which satisfies this.
Optimum
Solution strategies
Reduction to unconstrained optimization
Search direction from optimality conditions of approximation
See
For examples, see section on 2nd order approximation descent.
Local approximation by ellipsoid
Minimize
2nd order approximation descent
Aka Newton method. Minimize
Search direction
From optimality conditions for minimum of this approximation, get search direction:
Solving linear equation fast
Solving this linear equation can be slow in general:
Line search maintains feasibility
Affine invariant stopping criterion
Use
Justification for use of
Analysis: Use Newton method on equiv unconstrained problem
Take
Using infeasible start
Primal, dual variable update view
Take residue
Compare with search direction in feasible start case, note the change in last element of RHS.
\exclaim{Not a descent alg:
Line search
Backtracking line search is conducted on
Stopping criterion
Ax = b and
Switch to feasible start alg
As soon as you get
Inequality constrained problems
Barrier methods
Make inequality constraints implicit
Consider constraint
Motivation using indicator fn
The best log barrier to use is actually an indicatior function
Logarithmic barriers
As
Optimize both distance to log barrier and f0
Take problem in standard form. Solve the vector optimization problem: \
Tradeoff: minimizing f0 and repulsion from barrier
Scalarize this to get the objective
For
Barrier algorithm
Start with some t. Solve the centering problem specified by t. Grow
Interpretation using Lagrangian
Take the optimality condition : \ \(\gradient f_0(x^{}) +t^{-1}\sum_i (-f_i(x^{}))^{-1}\gradient f_i(x^{}) + t^{-1}A^{T}m^{}\).
This can be seen as the minimum of a Lagrangian-like function:
So, \(p^{} \geq g(l^{}(t), m^{}(t)) = L(x^{}(t),l^{}(t), m^{}(t)) = f_0(x^{}(t)) - m/t\). m/t is the duality gap, goes to 0 as
Primal, dual points on the central path
\((x^{}(t))\) are primal points on the central path. \((l^{}(t), m^{*}(t))\) are dual points on the central path.
Ideas for faster solution
To reduce the time taken to solve the convex optimization problem, you reduce : a] the time taken per iteration b] the number of iterations by using a clever initialization point.
Solving using the dual function
Take primal
Make extra inferences using KKT conditions.
Warm start
Can use the solution of a closely related optimization problem to solve the current problem. This idea is used in barrier method!
Sometimes this gives a better solution as using a bad initialization point would have returned a relatively worse solution due to the number of iterations exceeeding the limit specified in the maxiter parameter passed to the solver.
\part{Classes solved with convex programming}
Quasiconvex optimization problem
This can have locally optimal points which are not globally optimal: there can be plateaus: from properties of quasiconvex fn.
Generality of constraints
Can replace all quasi-convex sublevel sets, which are convex sets, with sublevel sets of convex functions. Eg: Consider the case of the linear fractional programs.
Solution using bisection
Take the epigraph form, fix t. Can solve this using bisection (see another section), with each optimization problem being solved using convex programming.
Second order cone program (SOCP)
Minimize a linear functional :
Eg: used in robust linear programming.
Second order cone constraints
Generality
Generalizes LP: Take
When
Conic inequalities convex program
Consider proper cones
Strong duality holds if Slater’s constraint qualification holds.
Conic form problem
Extends LP using conic inequalities:
Semidefinite programming (SDP)
Minimize linear fn
LMI’s define convex sets. Can also be viewed as a convex optimization problem specified using conic inequalities.
Generality
SOCP, which includes LP, can be reduced to SDP. Can replace second order cone constraint with $\mat{(c_i^{T}x + d_i) & A_ix + b_i\ (A_ix + b_i)^{T} & (c_i^{T}x + d_i)} \succeq 0$. \why
So, can flexibly specify SDP with semidefinite cone, quadratic cone, linear inequalities’ constraints; so called SQLP.
Recognizing SDP’s
First, can try manipulating objective to be a linear functional, perhaps by taking the epigraph form. Then,
Examples
ew minimization, matrix norm minimization.
Dual SDP
Lagrangian
Dual of dual is the primal!: Use
Geometric programming
Conversion to convex form: use
Eg: used in finding perron-frobenius ew:
\part{Non convex optimization}
Discrete optimization problems
Difficulty
The difficulty here arises from the fact that a huge (often exponentially large in the number of variables due to combinatorial explosion) number of assignments to the discrete variables should be considered in order to find the optimum.
General Strategies
Use exhaustive search.
Relaxation to allow continuous values
Constraints in an Integer program can be relaxed to allow variables to take on real values.
Graph based problems
Max flow, min cut problem. See graph theory survey.
Resource allocation
Often modelled with graphs. Edges indicate resource constraints or conflicts.
Maximum weight matching
Find the heaviest set of disjoint edges.
For bipartite graphs: if
Continuous variables: general strategies
The main difficulty here arises from the presence of a large number of local optima.
Or use local optimization attempts with random choices of initial points.
Or use a relaxation.
Convexification/ smoothing
Turn it into a convex optimization problem: eg change objective fn to
Dual of the dual
Dual of the dual is sometimes a convex relaxation to the original problem, besides being a lower bound to it.
Local approximation
Or approximate
Smoothing
Smoothing reduces irregularity of the function output - ie it reduces the depth of local minima.
Gaussian smoothing is frequently used:
As sampling
Global optimization can be seen as sampling from the feasible set, using anything monotonic with
In exploring the feasible set with special attention towards finding an optimum, one would want most updates to improve the objective, while the remaining updates help get out of local minima.
Stochastic gradient descent
Often the objective function can be decomposed as follows:
Here, for each iteration, one chooses a data-point
Comparison with stochastic gradient descent
Unlike gradient descent, which, using
Damping jitters
Sampling techniques may sometimes result in excessive oscillations in the value of
One may use a momentum parameter to dampen the abrupt changes in direction:
This is beneficial because in exploring the feasible set with special attention towards finding an optimum, one would want most updates to improve the objective, while the remaining updates help get out of local minima.
Using distribution sampling techniques
See section on sampling from a distribution in randomized algorithms survey. Note that in sampling for optimization, algorithms may pay attention towards finding optima.
Local optimization
Result
The result is often highly sensitive to the initial value of
Techniques
Can use any convex optimization technique, like gradient descent or alternating minimization.
\part{Discrete and Combinatorial optimization}
Integer programming (IP)
LP problem where variables can only take integer valued solutions. It is NP hard to find a solution: you have combinatorial hardness.
Approximate with LP; solve it; round the solutions. \tbc
Randomized rounding
Round
Optimal substructure problems
Aka Dynamic programming.
Applicability: Decision tree view
The problem can be cast as one of taking a sequence of decisions, and one wants to find the optimal sequence of decisions. So, essentially, one tries to find the optimal path through a decision tree. The number of decisions one needs to take is bounded by
Problems exhibit the ‘optimal substructure’ property, and also often the ‘overlapping subproblems’ property.
Optimal substructure
Optimal solutions of simpler subproblems can be compared in some way to find the overall optimal solution.
A problem corresponds to a decision tree
Eg: In case of shortest path problem:
Remembering subproblems used
Overlapping subproblems
The subproblems solved are repeated. This corresponds to the case where decision sub-paths to various leaves are actually identical. So it is profitable to remember solutions to subproblems.
Top down vs Bottom up
Top down solution
A top down solution can be easily expressed in terms of a recursive function
In doing this, if the ‘overlapping subproblem’ property holds, the algorithm memoizes : ie remembers optimal solutions to these subproblems whenever they are solved.
Bottom up
This solution is only applicable when the ‘overlapping subproblem’ property holds.
The algorithm solves decision trees of the smallest depth, records their results and builds solutions to progressively larger decision trees. So, one goes from level
Tabular view
Suppose that any node in the decision tree has at most
First, one constructs a list or column corresponding to the consequences of
Then, one constructs a list corresponding to the consequences of
One does this unductively until one covers all decisions up to level
Time complexity
From description of the bottom up solution, it is clear that time/ space required is
Examples
Shortest path algorithm can be formulated as dynamic program - see graph theory survey.
FFT: See functional analysis ref.
Determining the most likely state sequence in the case of a HMM.
Branch and bound
Systematic enumeration of all candidate solutions, where large subsets of fruitless candidates are discarded en masse, by using upper and lower estimated bounds of the quantity being optimized.
With belief propogation
Rewrite as a problem of finding the mode of a distribution:
This is useful when
Used in combinatorial optimization.