1 Quadratic

For ideas about its importance - both in solving general convex optimization problems and in other domains, look at the ‘Least squares’ chapter.

Linear Constrained QP

Aka Quadratic programming. Minimize convex quadratic functional over a polyhedron (see vector spaces survey). Generalizes linear programming, special case of QCQP.

Visualization

If optimal value occurs when a constraint is active: Think of contours of the objective hitting an edge of the polyhedron. Slightly harder than in the LP case, where the minimum always occured in a vertex.

Dual problem

This is also a quadratic program; strong duality always holds! \chk

Quadratically constrained QP (QCQP)

Minimize convex quadratic functional, subject to convex quadratic constraints: so all equality constraints are specified by linear inequalities. Feasible set is an intersection of hyperellipes and a polyhedron.

Least squared error problem

Problem

Aka least squares.

Unregularized problem

\(\min \norm{Aw - b}^{2}\), \(A \in R^{m \times n}, m>n, rank(A) = m\).

For the weighted least squares

More weight to one observation or correlated observations:\ Use Weighted inner product to find projection.\ \(WAx = Wb; \min \norm{WAx - Wb} = \min \norm{Ax-b}_{M}\) for \(M=W^{*}W\).

Solution

See numerical analysis survey for geometry and solution.

The regularized problem

Aka weight decay.

Want to balance love of sparsity or smallness with desire for a good fit. Pick a regularizer which is small for good w, and large for bad w.

Take \(\min f(w) + l\norm{w}n\). As n decreases, feasible set decreases: consider progression of 1 balls from \(\norm{.}\infty\) to \(\norm{}_0\).

Application

Convex Optimization

A common technique of solving convex optimization problems is to approximate the objective with a quadratic function and then solve it.

Other domains

See regression problem in statistics survey for motivation. Same as finding maximum likelihood solution while fitting linear model with Gaussian noise.

Interpolation of functions linear in parameters is also a specific case: see numerical analysis survey.

Regularized problems

If solving a regression problem, this amounts to the least squares solution, when a prior probability distribution is imposed on w to favor smaller w values.

\(w\) could be the coefficients of polynomial used to fit data in \(A\), \(b\). To avoid overfitting, we want \(w\) to be small.

Penalties and priors

Different regularizers (usually norms) correspond to different priors on w. Shape of the prior distribution looks like the unit spheres corresponding to the norms used: sphere, diamond etc..

\(\norm{}{q}\) penalty for \(q> 2\) usually not worth the effort. Sometimes, \(\norm{}{1.x}\) penalty used as compromize between lasso and quadratic regularizers.

Standard formulations

Normalizing columns of A

By solving \(A’w’ = b\), can solve \(ADD^{-1}w \approx b\) using diagonal matrix D.

Quadratic regularizer

Aka ridge regression. \ \(\min f(w) + lw^{T}I’w= \min \norm{Aw-b}_{2}^{2}+ lw^{T}I’w\) instead, to control size of w too.

This objective is the lagrangian fn corresponding to the problem of finding \(\min f(w) = \min \norm{Aw-b}_{2}^{2}\) subject to \(w^{T}I’w < c\).

The solution form

\((A^{T}A + lI)\hat{w} = A^{T}b\): by setting \(\gradient (f(w) + l\norm{w}{2}^{2})= 0\). Take \(A = U\SW V^{*}\), get \(X\hat{w} = U \SW (\SW^{2} + lI)^{-1} \SW U^{T}y = UDU^{T}y\), where \(D{i,i} = \sw_{i}’ = \frac{\sw_{i}^{2}}{\sw_{i}^{2} + l}\).

So, \(X\hat{w} = \sum \sw_{i}’ u_{i}^{T} u_{i}y\); observe action of \(l \in [0,\infty]\) in \(\sw_{i}’\): the shrinkage parameter; or the prior distribution on w.

The geometry

Consider hypersphere \(\norm{w}_{2}^{2} < c\) and the paraboloid \(f(w)\). The objective is to find w within the hypersphere, having the minimum \(f(w)\) value. Visualize by taking contour map of \(f(w)\) : successively larger ellipsoids. optimal \(w\) is where the minimal contour touches the hypersphere, when the center of \(f(w)\) lies outside the hypersphere.

Compare with geometry of \(\min f(w)\) problem: see linear alg survey.

Effect

Shrinks components of \(w\) towards each other in magnitude. Also takes care of correlated features.

Does not penalize small \(w_i\), so does not lead to sparsity.

l2 and linf regularizers

l2 regularizer

Problem

Solve \(\min_w \norm{Aw - b}^{2} + \gl \norm{w}_2\), with \(\gl > 0\). An alternate formulation is \(\min_w w^{T}Hw + v^{T}w + \gl \norm{w}_2\), with \(H \succeq 0\).

Solution

The obtimality condition involves taking the subdifferential of \(\norm{w}_2\). For diagonal \(H\), this has a closed form solution. The solution technique involves some nifty variable substitution.

Relation with Quadratic regularizer

Rewriting as \(\min_w \norm{Aw - b}^{2} : \norm{w}_2 < C(\gl)\); we see that this is equivalent to using the quadratic regularizer and solving: \(\min_w \norm{Aw - b}^{2} : \norm{w}_2^{2} < C(\gl)^{2}\). So, the tradeoff curve, is for practical purposes, identical.

The latter formulation has a simpler solution as subgradient calculations are not involved in specifying the optimality conditions for analytically finding a solution.

But, sometimes we are given the problem where an \(l_2\) regularizer is used. Such problems occur in doing block coordinate descent to solve multi-task lasso, for example, while finding the search direction using a quadratic approximation of the objective. In such cases, we can either solve the specified problem directly, or we can reformulate it using the quadratic regularizer; the former is quite simple.

lInf regularizer

Here, we solve \(\min_w \norm{Aw - b}^{2} + \gl \norm{w}\infty\). Optimality condition involves taking the subdifferential of \(\norm{w}\infty\). A closed form solution can be derived by using proof by cases: eg: see a paper on muti-task learning by Han Liu.

Forward stepwise regression for sparsity

Trying to solve Aw = y. Take F = active set of features used in approximating y. Start with \(F = \nullSet\), w = 0. At step t, find residue \(r = y - Aw_t\), find feature i most correlated with r; set \(w_{i}\) to this projection; continue. \exclaim{Observe how the residue changes at each step!}

This is a greedy algorithm, so suboptimal. But note that this is not the same as simply finding the optimal w’’ without any sparsity constraint, and then dropping some small elements.

Forward stagewise regression

A non greedy modification to forward stepwise. Find feature set B with maximum correlation with residue; for each such i: keep increasing \(w_{B}\) until you find another feature(s) with equal correlation with the residue; then add these feature(s) to B; repeat.

At any time, B is the set of features which form the \exclaim{least angle} with the residue. The coefficients increase such that \(Aw_B\) increases exactly along the projection of the residue on the hyperspace spanned by features in B.

Also, in case at some time, a feature correlated with other some feature in B gets added to B, the coefficient of B.

Least Angle Regression

A modification of the Forward stagewise regression, so that \(w_{B}\) is increased at one shot, rather than gradually.

Lasso solving modification

Losso just specifies an objective, which may be achieved in many ways, including using forward regression.

Can turn LAR to Lasso solver: Whenever \(w_{i}\) for some i in B hits 0, drop it out of B.

Reason for the fix

Compare conditions you get on lasso solution by setting gradient to 0 with conditions for LAR at any time: Correlation of residue with B with features in B is equal and correlation with other features is lower. They are identical as long as \(sgn(w_j) = \) sign of correlation of residue with feature j.

0 norm regularizer: Compressed sensing

Problem scenario

X is short and fat, Xw = y has many solutions for w, want to find w with \(\norm{w}_0\leq s\), \(Xw \approx y\). If X were tall and thin, the solution of a similar problem is easy: solve \(\min \norm{Xw - y}^{2}\) and pick the s most important components of w.

Consider the equivalent problem, where we assume columns of X are normalized.

Finding support: Combinatorial hardness

The difficulty comes from finding the support (non-zero coordinates) of w. Once support of w is found, it is easy to find the optimal linear combination of these components to get close to y: just solve \(\min \norm{X’w’ = y}\), where \(m \times s\) X’ is derived from X by dropping some columns.

There are many possible ways to form w with \(\norm{w}_0\leq s\).

Finding support: Target optimization problems

Maybe want to limit w to certain number of non-zeros. Solve \(\min f(w) + l \norm{w}_{0}\), where \(f(w) = \norm{Xw - y}_2^{2}\).

This is same as \(\min f(w): \norm{w}_{0} \leq s\). Feasible set is not convex: consider epigraph of \(\norm{w}_0 \leq 1\) for example.

A stricter version of the problem is: \(\min \norm{w}_0 : Xw = y\), as this does not allow \(Xw \approx y\).

Solution using 1 norm minimization

Just solve the linear program \(\min \norm{w}_1 : Xw = y\).

Arrival at sparse solution is guaranteed when some restricted isometry and incoherence properties hold for X.

Restricted isometry constant for s

\(1-\gd_{s} \leq \frac{\norm{Xw}^{2}}{\norm{w}^{2}} \leq 1+\gd_s, \forall w: \norm{w}_0 \leq s\).

Incoherence/ almost orthogonality

\(\abs{\dprod{Xw, Xw’}} \leq T_{s, s’}\) for w, w’ such that \(supp(w) \inters supp(s’) = \emptyset, \norm{w}_0 \leq s, \norm{w’}_0 \leq s’\).

1 norm regularizer: Lasso

\(\min f(w) + l \norm{w}{1} = \min \norm{Aw-b}{2}^{2} + l \norm{w}_{1}\). Allegedly aka least absolute shrinkage and selection operator.

Same as minimizing f(w) subject to \(\norm{w}{1} < c\). As you increase c, you get less and less sparse solutions. When \(c\geq \norm{\hat{w}}{1}\) where \(\hat{w}\) is the unregularized least squares solution, you get least squares solution.

Want to get optimality condition \(\gradient (f(w) + l \norm{w}_{1}) = 0\), but \(\norm{w}1\) not differentiable at all points; so take \(l \norm{w}{1} = l \sum_i sgn(w_i) w_i\). Assume that \(sgn(w_i’)\) around the solution \(w’\) is known.

Get conditions useful in Lasso solving modification of Least angles regression: Let \(B = {o: w_i’ \neq 0}\), the basis/ support set. \(\forall j\in B: a_{j}^{T}(y-Aw) = \gl sgn(w_j)\), \(\forall j\notin B: a_{j}^{T}(y-Aw) = \gl *sgn(w_j) \leq |\gl|\). If A’s columns have norm 1, get geometric meaning for \(\gl\): an upper bound for correlation of the residue with any feature!

Importance

Often yields sparse solutions: from geometry.

\(\norm{w}_q\) for \(q \in (0, 1]\) would also yield sparse solutions: see geometric interpretation for justification. These correspond to imposing the constraint \(\norm{w}_q \leq c\), which corresponds to the feasible region being non-convex. 1 norm is the least q which yields besides leading to a sparse solution, is also convex.

For importance of finding sparse models, see statistics survey.

The geometry

\(\norm{w}{1} < c\) is hyper-rhombus, f(w) is a paraboloid, whose each level set is an ellipsoid. In general, w for min f(w) will lie outside the hyper-rhombus. So, the optimal w: the point where the edge of the hyper-rhombus meets the t-level set: f(w) = t, for the smallest t. Usually this happens where 2 edges meet, or where many \(w{i}\) are 0: visualize in 2D; hence sparsity.

Compare with ridge regression.