2 General

The problem

Importance, efficient solvability

Superset of LP. Many problems in nature are convex optimization problems. Many non-convex problems have convex equivalents : see section on modelling/ specifying optimization problems.

(Nesterov, Nemorinsky) For any convex optimization problem, there exists self concordant barrier functions; so interior point methods and barrier methods can be made applicable: This is a non constructive proof. So, there exists a polynomial time algorithm for every convex optimization problem, but you will find it with certainty only if you can find these self concordant barrier functions.

Standard form

$min f_{0} (x) : f (x) \leq 0, A x = b$ .

All $f_{i} (x)$ are convex.

Convexity of feasible region X

Optimization fn is convex, constraint sets are convex sets. So, X is convex.

Geometry

$\gradient f_0(x^{})$, if $\neq 0$ , defines supporting hyperplane for X at $x^{}$. Imagine contours of $f_{0}$ , with minimum outside X, colliding with X at some point $x^{*}$ .

Identifying convex opt problems

Check Convexity of fesible regions

Maybe compare with known convex sets (see vector spaces survey).

Any equality constraints should be linear.

Eg: $x : h_{i} (x) = b$ not a convex set when $h_{i} (x)$ is a quadratic fn. These could be replaced by 2 inequality constraints in the standard form; both of which would need to represent convex sets.

Properties

Local optimum is the global optimum

By contradiction: Take radius R local optimum x; suppose there were y more than R away from $x$ with $f_{0} (y) < f_{0} (x)$ ; then, can conjure a point z, a convex combination of y, $x$ which is less than R away from $x$ with $f_{0} (z) < f_{0} (x)$ .

Lagrangian dual functional

Can easily get the Lagrangian dual functional: $g (l, m) = inf_{x} L (x, l, m)$ by setting $\nabla_{x} L () = 0$ and eliminating $x$ from L(x, l, m).

Certificate of optimality with strong duality

Aka KKT certificate. If feasible $(x^{'}, l^{'}, m^{'})$ satisfy KKT conditions, they are optimal. THis is a good way of solving convex optimization problem.

Pf: From complimentary slackness: $f_{0} (x^{'}) = L (x^{'}, l^{'}, m^{'})$ ; from the optimality condition and convexity of $f$ , $f_{0}$ : \ $g (l^{'}, m^{'}) = inf_{x} L (x, l^{'}, m^{'}) = L (x^{'}, l^{'}, m^{'}) = f_{0} (x^{'})$ . So, $x^{'}$ is optimal.

Bound norm of solution

All sublevelsets of $f_{0}$ are convex. So, find a ball ${x : f_{0} (x) \leq B > p^{*}}$ which includes $0$ ; then we can say that $∥ x ∥ \leq 2 B$ .

Eg: This technique was used in bounding the deviation of the solution of a l1 regularized logistic regression problem from the actual parameters defining an ising model by pradeep etal.

Dual problem

Strong Duality usually holds. ‘Constraint qualifications’ can tell you whether strong duality holds.

Strict feasibility Constraint qualification

(Slater) If there exists some strictly (primal) feasible $x$ , strong duality holds. \exclaim{This is very general: some strictly feasible $x$ should exist!} Generalizable to convex programs specificed using conic inequalities.

Actually, can relax constraint qualification to: $\exists x :$ x $\in r e l i n t (D)$ wrt affine plain $A x - b = 0$ .

\exclaim{Applies to convex conic inequality constraints too!} Also, this is a way to get strong duality without using the global optimality criteria perspective.

From supporting hyperplanes view

Consider the supporting hyperplanes view of the dual problem (see dual problem section). For convex optimization problems, convexity of $f$ , $f_{0}$ ensures that $G = {(u, t) | f_{0} (x) \leq t, f (x) \leq u}$ is convex. So, supporting hyperplane at $(0, p^{*})$ for both G and $G’=\set{(u, t)| f_0(x) \leq t, f(x) \leq 0}$ are the same.

When the primal is strictly feasible, the supporting hyperplane at $(0, p^{*})$ is non vertical; so intercept with $f_{0}$ axis is well defined; and the dual problem is feasible. Thence, slater’s condition.

Unconstrained problems: algorithms

Problem, algorithm framework

Problem, Assumptions

Problem: $m i n_{x} f (x)$ , f convex.

Algorithm framework

Produce sequence $x^{(k)} \in d o m (f), k \in Z_{+}$ : $f (x^{k}) \to p^{*}$ . Input to algorithm: starting point $x^{(0)}$ .

Alternate notation: $x^{+}$ for current iterate.

As Iteratively solving gradient eqns

Can be interpreted as iterative methods\ for solving optimality condition $\nabla f (x) = 0$ : a set of non-linear equations.

Common assumptions

Assumptions about f

f is twice continuously differentiable, so dom(f) is open.

$p^{*}$ is attained.

Strong convexity

If f is strongly convex, $f (x) - p^{*} \leq (2 m)^{- 1} {∥ f (x) ∥}_{2}^{2}$ ; so RHS can be used as a stopping criterion.

Also guarantees that sublevelsets are bounded: $f (y) \geq f (x) + \nabla f (x)^{T} (y - x) + \frac{m}{2} {∥ x - y ∥}_{2}^{2}$ ; this ensures faster convergence.

Initial point: Assumptions

$x^{(0)} \in d o m (f)$ .

Sublevel set $S = {x : f (x) \leq f (x^{(0)})}$ is closed. This is hard to verify usually, except when all sublevel sets are closed $\equiv$ epigraph of f is closed. This is needed, because some methods try to draw secants in the epigraph of f(x).

Descent methods

Algorithm

$x^{(k + 1)} = x^{(k)} + t^{(k)} Δ x^{(k)}$ with $f (x^{(k + 1)}) < f (x^{(k)})$ ; do this repeatedly until stopping criterion is met, in each iteration: finding step or search direction $Δ x$ , the step size/ length t.

This is guaranteed to eventually come arbitrarily close to the minimum.

Descent direction

$Δ x$ is a descent direction if $\nabla f (x)^{T} Δ x < 0$ .

Find search direction

\label{descent:Search Direction} See later sections.

Search direction from optimality conditions of approximation

Aka Newton equations in some cases.

Maybe model f(x) with $\hat{f} (x)$ , to find a search direction: find the optimality conditions for minimizing $\hat{f} (x)$ or f(x) and find $Δ x$ which does this exactly or approximately.

For examples, see section on 2nd order approximation descent.

Line search for t

Restriction to a slice

$g (t) = f (x + t Δ x)$ is f restricted to a slice along $Δ x$ . This is convex as f is convex.

Visualization

Plot $g (t)$ vs t, try to find $t$ which minimizes this.

Exact line search

$t = {a r g m i n}_{t > 0} g (t)$ . This is usually expensive.

Backtracking

Parameters: $a \in (0, 1 / 2), b \in (0, 1)$ ; b is the shrinkage parameter.

Start with t=1 (or small enough $t$ to guarantee that $x + t Δ$ x $\in d o m (f)$ ), repeat $t := b t$ until stopping criterion is met: $f (x + t Δ x) < f (x) + a t \nabla f (x)^{T} Δ x$ . With shrinkage of t, the RHS reduces. This is aka Armijo’s rule.

Can be rewritten as $f (x + t Δ x) - f (x) < a h (t)$ , where h(t) is the change in f for t, according to a linear approximation of $f$ . This form is useful when f() is composed of differentiable and non-differentiable parts.

Guaranteed reduction in objective

As $Δ x$ is a descent direction, \ $\nabla f (x)^{T} Δ x < 0$ .

So, when $f (x + t Δ x) < f (x) + a t \nabla f (x)^{T} Δ x$ , we have $f (x + t Δ x) < f (x)$ , so it is a step size which definitely reduces f().

Such a $t$ will always exist: you keep decreasing $t$ until that happens.

As secant in epigraph of g(t)

Take $g (t) : R \to R$ : one dim fn. Consider $\nabla_{t} g (t)$ at t=0. $\nabla_{t} g (0) = \nabla_{x} f (x) Δ$ x $\leq 0$ . $g (0) + a t \nabla_{t} g (0) = f (x) + a t \nabla f (x)^{T} Δ x$ , when a=1, is the tangent to g(t) at t=0, and therefore to f(x). $f (x) + a t \nabla f (x)^{T} Δ x$ , for $a < 1$ is such a secant: you are making the slope less negative.

x changes along the search direction only, but you change $t \in [0, t_{0}]$ , to get $t$ which is close (from below) to the intersection of a secant in the epigraph of g(t) with g(t).

As $f (x) + a t \nabla f (x)^{T} Δ x$ is a secant, for the stopping condition should be met for some $t$ if $Δ x$ is indeed a descent direction. This ensures that $t$ is such that $f (x + t Δ x) - f (x) \approx a t \nabla f (x)^{T} Δ x$ , the improvement achieved by using a secant.

Variants

Rather than ensure that $t$ is such that \ $f (x + t Δ x) - f (x) \approx a h (t)$ , the improvement achieved by using a secant; one can use second order approximations instead to define h, rather than a linear approximation: this perspective is different from the ‘secant view’.

Arbitrary choice

Using backtracking rule or doing exact search during line search guarantee reduction in the objective (and therefore convergence), but they can be expensive. An alternative can be to just use 1 as the step length.

The cost of doing so is that convergence is no longer guaranteed: it is possible for example that, taking the step 1 repeatedly, one cycles between the same pair of points between which lies the optimal point. However, in practice, convergence is usually observed.

This technique is used, for example, in \ ‘iterateively reweighted least squares’, where each step involves solving a least squares problems (which may correspond to the local 2nd order approximation).

Steepest descent

Aka Gradient descent.

As 1st order approximation minimization

As $f (x + v) = f (x) + ⟨ \nabla f (x), v ⟩$ , minimize $⟨ \nabla f (x), v ⟩$ - Also see geometric view section.

Descent direction

For a given $∥ . ∥$ , $Δ x = {a r g m a x}_{v : ∥ v ∥ = 1} ⟨ - \nabla f (x), v ⟩$ .

$⟨ \nabla f (x), v ⟩ = - ⟨ \nabla f (x), - v ⟩ = - {∥ \nabla f (x) ∥}_{*}$ . \exclaim{So this is a descent direction!}

Stopping criterion

Use $∥ \nabla x ∥ \leq ϵ$ as stopping criterion.

Convergence

$f(x^{(k)}) - p^{} \leq c^{k}(f(x^{(0)}) - p^{})$. \why

Geometric view

2-dim functional example

Consider contours/ level sets of $f : R^{2} \to R$ . The level set ${x : f (x) = t}$ surrounds ${x : f (x) = t - 1}$ ; etc.. At $x : f (x) = t$ , the $\nabla f (x)$ is perpendicular to level set ${x : f (x) = t}$ , pointing away from ${x : f (x) = t - 1}$ ; and so is the gradient descent direction; but it points towards ${x : f (x) = t - 1}$ .

Goodness for circular level set case

If the level set contours are circular, then $- \nabla f (x)$ points straight towards the minimum level set. But, consider an ellipsoid. Then, $- \nabla f (x)$ is passes through a few level sets ${x : f (x) = t^{'} < t}$ , but it misses many smaller level sets. Eg: imagine level sets which are shaped like concentric rounded triangles.

Zig-zagness of path towards optimum

In this case, despite finding the best step size, the sequence of points produced is not a direct line towards $x^{*}$ , but forms a zig-zag path.

Implications of choice of norm

If you choose a good $∥ . ∥$ , the shape of whose unit ball approximates the shape of the level sets/ contours, $Δ x$ will be such that it points towards the minimal contour of $f$ , while still having a large enough inner product with $- \nabla f (x)$ to guarantee that it is a descent direction. So, you can get to $x^{*}$ more directly.

But, if you pick a bad norm, Eg: \exclaim{ ${∥ ∥}_{2}$ which results in a reduction to gradient descent}, the path to $x^{*}$ is longer.

2nd order approximation descent

Aka Newton method (reason described below), but not Gauss-Newton method, which is a further approximation and is specific to least squares problem.

General search direction

Take $f_{0} (x + d) \approx f_{0} (x) + x^{T} \nabla f (x) + x^{T} H x$ for some $H ⪰ 0$ . If $H$ is the 2nd order derivative, this is the Newton method described later. Often $H$ is an easily computed approximation of $\nabla^{2} f_{0} (x)$ .

Search direction from Hessian

$Δ x = - \nabla^{2} f (x)^{- 1} \nabla f (x)$ . Minimized 2nd order approximation: $\hat{f} (x + v) = f (x) + ⟨ \nabla f (x), v ⟩ + 2^{- 1} v^{T} \nabla^{2} f (x) v$ . So, $f (x)$ approximated with a quadratic curve!

This is Aka Newton’s method, as it corresponds to solving for the root of the optimality condition $\nabla f (x) = 0$ .

Solving Newton equations fast

Solving the linear system of equations: $Δ x = - \nabla^{2} f (x)^{- 1} \nabla f (x)$ can be expensive in general: $O (n^{3})$ . But, in practice, one can often exploit structure in the $\nabla^{2} f (x)^{- 1}$ matrix to find the (possibly approximate) search direction fast - this is very important. Can often use iterative methods to solve this.

Correctly computing gradient and Hessian

See comments from the gradient descent case.

Geometric view

Consider the 2-dimensional example described in the ‘geometric view of steepest descent’ section.

Local approximation by ellipses

Note: picking $∥ x ∥ = (⟨ x^{T} P x ⟩)^{1 / 2}$ for $P ⪰ 0$ yields ellipsoid unit balls. This can be thought of taking: $\hat{f} (x + v) = f (x) + ⟨ \nabla f (x), v ⟩ + 2^{- 1} v^{T} P v$ and using $inf_{v} \hat{f} (x + v)$ as the search direction.

2nd order approximation ellipses

Choosing $∥ x ∥ = (⟨ x^{T} \nabla^{2} f (x) x ⟩)^{1 / 2}$ , the local Hessian norm, results in reduction to 2nd order approximation descent. To see this, consider the ‘constraints in the objective’ form of $Δ x = {a r g m a x}_{v : ∥ v ∥ = 1} ⟨ - \nabla f (x), v ⟩$ .

Affine invariance

The newton method is supposed to be invariant to affine transformation of the input: can confirm by computing $\nabla$ and $\nabla^{2}$ for $f (A x^{'} + b)$ and $f (x)$ , where $A x^{'} + b = x$ .

Stopping criterion: Affine invariant

Take $λ (x) = (\nabla f (x)^{T} \nabla^{2} f (x)^{- 1} \nabla f (x))^{1 / 2} = (- \nabla f (x)^{T} Δ x)$ , use $λ (x) \leq ϵ$ .

Estimates proximity to $x^{*}$ : $f (x) - inf_{y} \hat{f} (y) = 2^{- 1} λ (x)^{2}$ : from simple substitution.

Equals length of $Δ x$ in the local Hessian norm : see steepest descent connection.

Also measures the $Δ x$ directional derivative, $\nabla f (x)^{T} Δ x = - λ (x)^{2}$ . So, it measures how close $\nabla f (x)$ is to being 0, as it is at $x^{*}$ .

This is also affine invariant, unlike $∥ x ∥$ . Pf: Consider x = Ax’ +b. Consider $D_{x^{'}} f (A x^{'} + b) = D_{A x^{'} + b} f (A x^{'} + b) A = D_{x} f (x) A$ ; thence get $\nabla_{x^{'}} f (A x^{'} + b)$ , and thence $\nabla_{x^{'}}^{2} f (A x^{'} + b) = D_{x^{'}} \nabla_{x^{'}} f (A x^{'} + b)$ .

Speed: comparison with 1st order methods

Computing the search direction makes 2nd order methods slow in general; but if some structure in $\nabla^{2} f (x)$ is exploited to make this faster, it becomes very fast.

Each iteration of 1st order methods are usually much faster, but many more iterations are required.

Convergence: classical bounds

Assumptions

f strongly convex with constant m.

$\nabla^{2} f$ is Lipschitz continuous. This measures how well f(y) can be approximated by the quadratic functional $f (x) + \nabla f (x) (y - x) + 2^{- 1} (y - x)^{T} \nabla^{2} f (x) (y - x)$ .

$\exists η \in [0, m^{2} / L], γ > 0$ .

Linear decrease phase

Aka Damped Newton Phase. $∥ \nabla f (x) ∥ \geq η ⟹ f (x^{(k + 1)}) - f (x^{(k)}) \leq - γ$ . Linear decrease in objective value.

This phase ends after atmost $\frac{f (x^{(0)} - p^{*})}{γ}$ iterations.

Most iterations require backtracking search.

Quadratically convergent phase

$∥ \nabla f (x)^{(k)} ∥ \leq η ⟹ \frac{L}{2 m^{2}} ∥ \nabla f (x^{(k + 1)}) ∥ \leq (\frac{L}{2 m^{2}} ∥ \nabla f (x^{(k)} ∥)^{2}$ . Quadratic decrease in gradient size.

So, for $l > k$ : $\frac{L}{2 m^{2}} ∥ \nabla f (x^{(l)}) ∥ \leq 2^{- 2^{(l - k)}}$ .

All iterations use step size $t = 1$ ! \why

\exclaim{Observing these on a (maybe log) residue vs iteration and step size vs iteration plots is a good way of verifying correct implementation of gradients etc..!}

Overall bounds, defects

To achieve $f(x) - p^{} \leq \eps$: $\frac{f(x^{(0)} - p^{})}{\gamma} + \log \log (\eps_0/ \eps)$ iterations needed, where $ϵ_{0} = f (m, L)$ too.

Provides qualitative insight into behaviour of 2nd order approx descent.

But, constants m, L are usually unknown: so cannot say beforehand when convergence will happen.

Bounds are not affine invariant, even though the 2nd order approx descent method is.

Convergence for self concordant functions

(Nesterov, Nemorinsky) Get better bounds, which don’t suffer from the defects of classical analysis.

Alternating minimization

Consider f(x, y); repeat these steps: \ $x := {a r g m i n}_{x} f_{0} (x, y), y := {a r g m i n}_{y} f_{0} (x, y)$ . Guarantees that the objective is reduced in each iteration.

When minimization is done one coordinate at a time, it is called coordinate descent; if it done one coordinate-set at a time, it is called block-coordinate descent.

Diagnosing error in code

Incorrect gradient computation: symptoms

Often algebraic mistakes happen while computing the gradient.

To detect such a case, compare with the numerically computed gradient: See numerical analysis survey.

Or, observe that the step size found in the direction of the gradient is very small.

Equality constrained problems

Problem, assumptions

$min f (x) : A x = b$ . So, optimality conditions: $\gradient f(x^{}) + A^{T}v^{} = 0$.

Common Assumptions

f is twice differentiable, A has full row rank: can always get an equivalent problem which satisfies this.

Optimum $p^{*}$ is attained.

Solution strategies

Reduction to unconstrained optimization

$min f (x) : A x = b \equiv min f (H v + x^{'})$ , where $A x^{'} = b$ , H spans null(A).

Search direction from optimality conditions of approximation

See $???$ .

For examples, see section on 2nd order approximation descent.

Local approximation by ellipsoid

Minimize $f (x + Δ x) \approx \hat{f} (x + Δ x) = f (x) + \nabla f (x)^{T} v + 0.5 Δ x^{T} P Δ x$ subject to $A (x + Δ x) = b$ , for $P ≻ 0$ . From optimality conditions for minimum of this approximation, get search direction: $[\begin{matrix} P & A^{T} A & 0 \end{matrix}] [\begin{matrix} Δ \) x \(w \end{matrix}] = [\begin{matrix} - Δ f (x) 0 \end{matrix}]$ .

2nd order approximation descent

Aka Newton method. Minimize $f (x + Δ x) \approx \hat{f} (x + Δ x) = f (x) + \nabla f (x)^{T} v + 0.5 Δ x^{T} \nabla^{2} f (x) Δ x$ subject to $A (x + Δ x) = b$ .

Search direction

From optimality conditions for minimum of this approximation, get search direction: $[\begin{matrix} \nabla^{2} f (x) & A^{T} A & 0 \end{matrix}] [\begin{matrix} Δ \) x \(w \end{matrix}] = [\begin{matrix} - Δ f (x) 0 \end{matrix}]$ .

Solving linear equation fast

Solving this linear equation can be slow in general: $O (n^{3})$ ; but as in the unconstrained case, if you can exploit some special structure in the LHS matrix - maybe to find approximate search direction, search direction can be found fast : see unconstrained case for details.

Line search maintains feasibility

$Δ$ x $\in n u l l (A)$ , so $t Δ x$ found during line search remains feasible. \exclaim{Feasibility always maintained!}

Affine invariant stopping criterion

Use $λ (x) = (Δ x^{T} \nabla^{2} f (x) Δ x)^{1 / 2} = (- \nabla f (x)^{T} Δ x)^{1 / 2}$ . Use last term to compare with stopping criterion for unconstrained minimization case. In general, it is not same as $Δ x^{T} \nabla^{2} f (x) Δ x$ .

Justification for use of $λ (x)^{2} / 2$ as stopping criterion: Just as in the unconstrained case, $f (x) - p^{*} \approx f (x) - inf_{A x = b} \hat{f} = 2^{- 1} λ (x)^{2}$ ; and $- λ (x)^{2}$ is also the directional derivative along $Δ x$ .

Analysis: Use Newton method on equiv unconstrained problem

Take $min f (x) : A x = b \equiv min f (H v + x^{'})$ form. $x^{(t)} = H v^{(t)} + x^{'}$ . So, analysis of convergence of unconstrained problem applies here too!

Using infeasible start

Primal, dual variable update view

Take residue $r (x, m) = (\nabla f (x) + A^{T} m, A x - b)$ , solve the vector optimization problem: $min_{x, m} r (x, m)$ : so, minimizing the objective while maintaining constraint, but also getting close to feasibility. Try to get $r (y + Δ y) \approx r (y) + D r (y) Δ y = 0$ . So, use search direction from solving: \ $[\begin{matrix} \nabla^{2} f (x) & A^{T} A & 0 \end{matrix}] [\begin{matrix} Δ \) x \(w \end{matrix}] = - [\begin{matrix} Δ f (x) A x - b \end{matrix}]$ , with $w =$ m $+ Δ m$ , the new guess for the dual variable.

Compare with search direction in feasible start case, note the change in last element of RHS.

\exclaim{Not a descent alg: $f (x^{+}) \leq f (x)$ possible!}

Line search

Backtracking line search is conducted on $∥ r (y) ∥$ . Stopping condition for line search: $∥ r (y + Δ y) ∥ \leq (1 - α t) ∥ r (y) ∥$ , for a certain $α$ . So, stopping criterion becomes stricter as $t$ increases.

Stopping criterion

Ax = b and $∥ r (y) ∥ \leq ϵ$ .

Switch to feasible start alg

As soon as you get $x^{(k)} : A x^{(k)} = b$ , switch to feasible start version of the algorithm: often ensures faster descent.

Inequality constrained problems

Barrier methods

Make inequality constraints implicit

Consider constraint $f_{i} (x) \leq 0$ . Can make this constraint implicit by including barrier functional $- ϕ (x)$ in the objective $f_{0}^{'} (x)$ : now, the constraint is enforced by thus constraining the $d o m (f_{0} (x))$ .

Motivation using indicator fn

The best log barrier to use is actually an indicatior function $ϕ (x) = I_{f_{i} (x) \leq 0}$ which is 0 if $f_{i} (x) \leq 0$ , $\infty$ otherwise.

Logarithmic barriers

As $t \to \infty$ , get $- t^{- 1} \sum_{i} \log - f_{i} (x) \approx I_{f_{i} (x) \leq 0}$ . So, $ϕ (x) = \sum_{i} \log f_{i} (x)$ . This is a good barrier functional, as it is convex, twice continuously differentiable.

Optimize both distance to log barrier and f0

Take problem in standard form. Solve the vector optimization problem: \ $min (f_{0} (x), - ϕ (x)) : A x = b$ , where $ϕ (x) = \sum \log (- f_{i} (x))$ is the log barrier, which is convex from composition rules.

Tradeoff: minimizing f0 and repulsion from barrier

Scalarize this to get the objective $min g (t) = min$ t $f_{0} (x) - ϕ (x)$ : aka the centering problem. This is an equality constrained convex optimization problem. Bigger $t$ tends to favor minimizing $f_{0} (x)$ , while allowing $x$ to get closer to the barrier $f (x) = 0$ .

For $t$ large enough, letting $x$ get closer to the barrier does not affect the minimization of $f_{0} (x)$ : maybe minimum is acheived at a point where some constraints are inactive.

Barrier algorithm

Start with some t. Solve the centering problem specified by t. Grow $t$ by $μ \approx 20$ . Repeat until stopping criterion is satisfied.

Interpretation using Lagrangian

Take the optimality condition : \ $\gradient f_0(x^{}) +t^{-1}\sum_i (-f_i(x^{}))^{-1}\gradient f_i(x^{}) + t^{-1}A^{T}m^{}$.

This can be seen as the minimum of a Lagrangian-like function: $L (x, l, m) = f_{0} (x) + \sum_{i} l_{i} f_{i} (x) + m^{T} A x$ , with $l \geq 0$ . The optimal values are: \ $l_i^{} = t^{-1}(-f_i(x^{}))^{-1} \geq 0, A^{T}m^{*} = 0$.

So, $p^{} \geq g(l^{}(t), m^{}(t)) = L(x^{}(t),l^{}(t), m^{}(t)) = f_0(x^{}(t)) - m/t$. m/t is the duality gap, goes to 0 as $t \to \infty$ . So, $x^{}(t) \to x^{*}$ as $t \to \infty$ . \exclaim{Can use m/t as stopping criterion!}

Primal, dual points on the central path

$(x^{}(t))$ are primal points on the central path. $(l^{}(t), m^{*}(t))$ are dual points on the central path.

Ideas for faster solution

To reduce the time taken to solve the convex optimization problem, you reduce : a] the time taken per iteration b] the number of iterations by using a clever initialization point.

Solving using the dual function

Take primal $min f_{0} (x) : {f_{i} (x) \leq 0}, {h_{i} (x) = 0}$ ; Get Lagrangian: $L (x, l, m)$ ; get $g (x) = inf_{x} L (x, l, m)$ ; Solve $max_{l, m} g (l, m)$ ; derive $x^{}$ from $l^{}, m^{*}$.

Make extra inferences using KKT conditions.

Warm start

Can use the solution of a closely related optimization problem to solve the current problem. This idea is used in barrier method!

Sometimes this gives a better solution as using a bad initialization point would have returned a relatively worse solution due to the number of iterations exceeeding the limit specified in the maxiter parameter passed to the solver.

\part{Classes solved with convex programming}

Quasiconvex optimization problem

$f_{0}$ is quasi-convex, $f_{i} (x)$ are convex. Epigraph form: $min t : f_{0} (x) \leq t; f (x) \leq 0, A x = b$ .

This can have locally optimal points which are not globally optimal: there can be plateaus: from properties of quasiconvex fn.

Generality of constraints

Can replace all quasi-convex sublevel sets, which are convex sets, with sublevel sets of convex functions. Eg: Consider the case of the linear fractional programs.

Solution using bisection

Take the epigraph form, fix t. Can solve this using bisection (see another section), with each optimization problem being solved using convex programming.

Second order cone program (SOCP)

Minimize a linear functional : $f^{T} x$ subject to Fx = g and second order cone constraints.

Eg: used in robust linear programming.

Second order cone constraints

${∥ A_{i} x - b_{i} ∥}_{2} - (c_{i}^{T} x + d_{i}) \leq 0 : A_{i} \in R^{n_{i} \times n}$ . This is obviously a convex constraint as the LHS is a sum of a convex function and a linear function. But squaring both sides is not a good way of showing convexity of the feasible set.

$(A_{i}$ x $- b_{i}, c_{i}^{T}$ x $+ d_{i})$ distinguishes part of a second order cone in $R^{n_{i} + 1}$ .

Generality

Generalizes LP: Take $n_{i} = 0$ .

When $c_{i} = 0$ , becomes QCQP. Does it generalize all QCQP? \chk

Conic inequalities convex program

Consider proper cones $K_{i}$ . Then, let objective $f_{0}$ remain convex, have equality constraints $A x = b$ , but specify constraints using generalized convex functions: $f_{i} (x) ⪯_{K_{i}} 0$ . This is still a convex program as suvblevelsets of the above continue to be convex.

Strong duality holds if Slater’s constraint qualification holds.

Conic form problem

Extends LP using conic inequalities: $F x + g ⪯_{K} 0$ .

Semidefinite programming (SDP)

Minimize linear fn $c^{T} x$ subject to linear matrix inequalities (LMI) involving symmetric matrices (see matrix algebra survey). Can collapse n LMI’s into a single LMI $A_{0} + \sum x_{i} A_{i} ⪯ 0$ by increasing the matrix size n times.

LMI’s define convex sets. Can also be viewed as a convex optimization problem specified using conic inequalities.

Generality

SOCP, which includes LP, can be reduced to SDP. Can replace second order cone constraint with $\mat{(c_i^{T}x + d_i) & A_ix + b_i\ (A_ix + b_i)^{T} & (c_i^{T}x + d_i)} \succeq 0$. \why

So, can flexibly specify SDP with semidefinite cone, quadratic cone, linear inequalities’ constraints; so called SQLP.

Recognizing SDP’s

First, can try manipulating objective to be a linear functional, perhaps by taking the epigraph form. Then, $A ⪯ 0$ with $A_{i, j} = a_{i j}^{T} x + b_{i j}$ form of LMI’s are common. Also, Schur’s complement is often useful in forming the constraint.

Examples

ew minimization, matrix norm minimization.

Dual SDP

Lagrangian $L (x, Z) = c^{T} x + ⟨ A_{0} + \sum x_{i} A_{i}, Z ⟩ = c^{T} x + ⟨ A_{0}, Z ⟩ + \sum x_{i} ⟨ A_{i}, Z ⟩$ with $Z ⪰ 0$ .

$g (Z) = inf_{x} L (x, Z) = ⟨ A_{0}, Z ⟩$ if $\sum x_{i} ⟨ A_{i}, Z ⟩ + c_{i} x_{i} = 0 \forall x$ , else $g (Z) = - \infty$ . So, dual problem: $max ⟨ A_{0}, Z ⟩ : Z ⪰ 0, ⟨ A_{i}, Z ⟩ + c_{i} = 0 \forall i$ . (Note: used self-duality of $S_{+ +}^{n}$ in writing $Z ⪰ 0$ .) \exclaim{Also an SDP!}

Dual of dual is the primal!: Use $L (Z, m, L) = ⟨ G, Z ⟩ + \sum_{i} m_{i} ⟨ F_{i}, Z ⟩ + ⟨ L, - Z ⟩ + c^{T} m$ .

Geometric programming

$min_{x} f_{0} (x) : {f_{i} (x) \leq 0}, {h_{i} (x) = 0}$ , with all $f_{i}$ posynomials and $h_{i}$ monomials (see vector functionals survey).

Conversion to convex form: use $y_{i} = \log x_{i}$ and restate problem.

Eg: used in finding perron-frobenius ew: $| λ_{m a x} (A) |$ .

\part{Non convex optimization}

Discrete optimization problems

Difficulty

The difficulty here arises from the fact that a huge (often exponentially large in the number of variables due to combinatorial explosion) number of assignments to the discrete variables should be considered in order to find the optimum.

General Strategies

Use exhaustive search.

Relaxation to allow continuous values

Constraints in an Integer program can be relaxed to allow variables to take on real values.

Graph based problems

Max flow, min cut problem. See graph theory survey.

Resource allocation

Often modelled with graphs. Edges indicate resource constraints or conflicts.

Maximum weight matching

Find the heaviest set of disjoint edges.

For bipartite graphs: if $\exists$ a unique matching, loopy belief propogation will find it.

Continuous variables: general strategies

The main difficulty here arises from the presence of a large number of local optima.

Or use local optimization attempts with random choices of initial points.

Or use a relaxation.

Convexification/ smoothing

Turn it into a convex optimization problem: eg change objective fn to $e^{x}$ , or maybe strong duality holds: so can solve dual problem: see section on modelling/ formulating problems too.

Dual of the dual

Dual of the dual is sometimes a convex relaxation to the original problem, besides being a lower bound to it.

Local approximation

Or approximate $f_{0}$ (maybe locally) by a convex function or atleast a smoother function. Eg: Trust region methods.

Smoothing

Smoothing reduces irregularity of the function output - ie it reduces the depth of local minima.

Gaussian smoothing is frequently used: $g (x_{0}) = \int_{x \in [- \infty, \infty]} e^{- λ (x - x_{0})^{2}} f (x) d x$ .

As sampling

Global optimization can be seen as sampling from the feasible set, using anything monotonic with $f_{0} (w) = E (w)$ as a measure of energy/ improbability - albeit with the intention of finding the minimum.

In exploring the feasible set with special attention towards finding an optimum, one would want most updates to improve the objective, while the remaining updates help get out of local minima.

Stochastic gradient descent

Often the objective function can be decomposed as follows: $E (w) = \sum_{i = 1 : n} E_{i} (w)$ - for example, in terms of various data-points in the case of maximum likelihood esitation.

Here, for each iteration, one chooses a data-point $E_{i} (w)$ at random or in a sequence, and uses $- \nabla E_{i} (w)$ as a descent direction.

Comparison with stochastic gradient descent

Unlike gradient descent, which, using $\nabla E (w)$ as a direction, can consistently reduce E(w) if the step size is appropriately chosen, stochastic gradient descent does not make such a guarantee. The advantage of stochastic gradient descent is in not being stuck at local optima, and in the greater speed with which $\nabla E_{i} (w)$ can be evaluated.

Damping jitters

Sampling techniques may sometimes result in excessive oscillations in the value of $E (w)$ after each variable update: $w_{i} = w_{i - 1} + Δ w_{i - 1}$ , where $Δ w_{i - 1} = α t_{i - 1}$ using step size $α$ and direction $t_{i - 1}$ .

One may use a momentum parameter to dampen the abrupt changes in direction: $Δ w_{i - 1} = Δ w_{i - 2} + α t_{i - 1}$ .

This is beneficial because in exploring the feasible set with special attention towards finding an optimum, one would want most updates to improve the objective, while the remaining updates help get out of local minima.

Using distribution sampling techniques

See section on sampling from a distribution in randomized algorithms survey. Note that in sampling for optimization, algorithms may pay attention towards finding optima.

Local optimization

Result

The result is often highly sensitive to the initial value of $x$ . Also, one cannot guarantee that one will reach the minimum closest to the initial point: For example, when gradient descent is used, the angle of the gradient may lead one to a point which then leads a different well.

Techniques

Can use any convex optimization technique, like gradient descent or alternating minimization.

\part{Discrete and Combinatorial optimization}

Integer programming (IP)

LP problem where variables can only take integer valued solutions. It is NP hard to find a solution: you have combinatorial hardness.

Approximate with LP; solve it; round the solutions. \tbc

Randomized rounding

Round $x$ to $⌊ x ⌋$ with prob $x - ⌊ x ⌋$ .

Optimal substructure problems

Aka Dynamic programming.

Applicability: Decision tree view

The problem can be cast as one of taking a sequence of decisions, and one wants to find the optimal sequence of decisions. So, essentially, one tries to find the optimal path through a decision tree. The number of decisions one needs to take is bounded by $N$ .

Problems exhibit the ‘optimal substructure’ property, and also often the ‘overlapping subproblems’ property.

Optimal substructure

Optimal solutions of simpler subproblems can be compared in some way to find the overall optimal solution.

A problem corresponds to a decision tree $D_{l}$ at level $l$ . Each subproblem corresponds to finding optimal path $p_{i}$ through a different decision subtree $D_{i}$ one would arrive at by fixing the first decision to be $e_{i}$ . One constructs the optimal decision path $p = min_{i} f (p_{i} + e_{i})$ .

Eg: In case of shortest path problem: $d (s, e) = min_{v \in Γ (s)} [d (s, v) + d (v, e)]$ .

Remembering subproblems used

$p$ is a sequence of decisions $d_{l, i}$ - each corresponding to making decision $i$ at level $l$ of the decision tree. One needs to remember the decision taken at level $l$ - the optimal subpath augmented. Eg: in the example above, in order to reconstruct the shortest path from s to e, one needs to remember which $v \in Γ (s)$ was used.

Overlapping subproblems

The subproblems solved are repeated. This corresponds to the case where decision sub-paths to various leaves are actually identical. So it is profitable to remember solutions to subproblems.

Top down vs Bottom up

Bottom up

This solution is only applicable when the ‘overlapping subproblem’ property holds.

The algorithm solves decision trees of the smallest depth, records their results and builds solutions to progressively larger decision trees. So, one goes from level $N$ and works one’s way up to level $1$ .

Tabular view

Suppose that any node in the decision tree has at most $N$ children. This process can be viewed by means of a $N \times M$ table or a list of $N$ lists.

First, one constructs a list or column corresponding to the consequences of $M$ different decisions at level $N$ .

Then, one constructs a list corresponding to the consequences of $M$ decisions at level $N - 1$ , and also a list of ‘backpointers’ specifying the ideal decision at level $N$ if one were to fix decision $d$ at level $N - 1$ .

One does this unductively until one covers all decisions up to level $1$ .

Time complexity

From description of the bottom up solution, it is clear that time/ space required is $M * N$ - unlike $M^{N}$ in case all paths in the entire decision tree are to be considered (true in case ‘overlapping subproblems’ property does not hold).

Examples

Shortest path algorithm can be formulated as dynamic program - see graph theory survey.

FFT: See functional analysis ref.

Determining the most likely state sequence in the case of a HMM.

Branch and bound

Systematic enumeration of all candidate solutions, where large subsets of fruitless candidates are discarded en masse, by using upper and lower estimated bounds of the quantity being optimized.

With belief propogation

Rewrite as a problem of finding the mode of a distribution: $max_{x} P r (x) : P r (x) \propto 1_{f (x) \leq 0, h (x) = 0} e^{f_{0} (x)}$ : the exponentiation is to ensure non-negativity.

This is useful when $f_{0},$ f $, h$ are decomposable into functionals over cliques: then can take advantage of factorization.

Used in combinatorial optimization.