For ideas about its importance - both in solving general convex optimization problems and in other domains, look at the ‘Least squares’ chapter.
Linear Constrained QP
Aka Quadratic programming. Minimize convex quadratic functional over a polyhedron (see vector spaces survey). Generalizes linear programming, special case of QCQP.
Visualization
If optimal value occurs when a constraint is active: Think of contours of the objective hitting an edge of the polyhedron. Slightly harder than in the LP case, where the minimum always occured in a vertex.
Dual problem
This is also a quadratic program; strong duality always holds! \chk
Quadratically constrained QP (QCQP)
Minimize convex quadratic functional, subject to convex quadratic constraints: so all equality constraints are specified by linear inequalities. Feasible set is an intersection of hyperellipes and a polyhedron.
Least squared error problem
Problem
Aka least squares.
Unregularized problem
For the weighted least squares
More weight to one observation or correlated observations:\ Use Weighted inner product to find projection.\
Solution
See numerical analysis survey for geometry and solution.
The regularized problem
Aka weight decay.
Want to balance love of sparsity or smallness with desire for a good fit. Pick a regularizer which is small for good w, and large for bad w.
Take \(\min f(w) + l\norm{w}n\). As n decreases, feasible set decreases: consider progression of 1 balls from \(\norm{.}\infty\) to
Application
Convex Optimization
A common technique of solving convex optimization problems is to approximate the objective with a quadratic function and then solve it.
Other domains
See regression problem in statistics survey for motivation. Same as finding maximum likelihood solution while fitting linear model with Gaussian noise.
Interpolation of functions linear in parameters is also a specific case: see numerical analysis survey.
Regularized problems
If solving a regression problem, this amounts to the least squares solution, when a prior probability distribution is imposed on w to favor smaller w values.
Penalties and priors
Different regularizers (usually norms) correspond to different priors on w. Shape of the prior distribution looks like the unit spheres corresponding to the norms used: sphere, diamond etc..
\(\norm{}{q}\) penalty for
Standard formulations
Normalizing columns of A
By solving
Quadratic regularizer
Aka ridge regression. \
This objective is the lagrangian fn corresponding to the problem of finding
The solution form
So,
The geometry
Consider hypersphere
Compare with geometry of
Effect
Shrinks components of
Does not penalize small
l2 and linf regularizers
l2 regularizer
Problem
Solve
Solution
The obtimality condition involves taking the subdifferential of
Relation with Quadratic regularizer
Rewriting as
The latter formulation has a simpler solution as subgradient calculations are not involved in specifying the optimality conditions for analytically finding a solution.
But, sometimes we are given the problem where an
lInf regularizer
Here, we solve \(\min_w \norm{Aw - b}^{2} + \gl \norm{w}\infty\). Optimality condition involves taking the subdifferential of \(\norm{w}\infty\). A closed form solution can be derived by using proof by cases: eg: see a paper on muti-task learning by Han Liu.
Forward stepwise regression for sparsity
Trying to solve Aw = y. Take F = active set of features used in approximating y. Start with
This is a greedy algorithm, so suboptimal. But note that this is not the same as simply finding the optimal w’’ without any sparsity constraint, and then dropping some small elements.
Forward stagewise regression
A non greedy modification to forward stepwise. Find feature set B with maximum correlation with residue; for each such i: keep increasing
At any time, B is the set of features which form the \exclaim{least angle} with the residue. The coefficients increase such that
Also, in case at some time, a feature correlated with other some feature in B gets added to B, the coefficient of B.
Least Angle Regression
A modification of the Forward stagewise regression, so that
Lasso solving modification
Losso just specifies an objective, which may be achieved in many ways, including using forward regression.
Can turn LAR to Lasso solver: Whenever
Reason for the fix
Compare conditions you get on lasso solution by setting gradient to 0 with conditions for LAR at any time: Correlation of residue with B with features in B is equal and correlation with other features is lower. They are identical as long as
0 norm regularizer: Compressed sensing
Problem scenario
X is short and fat, Xw = y has many solutions for w, want to find w with
Consider the equivalent problem, where we assume columns of X are normalized.
Finding support: Combinatorial hardness
The difficulty comes from finding the support (non-zero coordinates) of w. Once support of w is found, it is easy to find the optimal linear combination of these components to get close to y: just solve
There are many possible ways to form w with
Finding support: Target optimization problems
Maybe want to limit w to certain number of non-zeros. Solve
This is same as
A stricter version of the problem is:
Solution using 1 norm minimization
Just solve the linear program
Arrival at sparse solution is guaranteed when some restricted isometry and incoherence properties hold for X.
Restricted isometry constant for s
Incoherence/ almost orthogonality
1 norm regularizer: Lasso
\(\min f(w) + l \norm{w}{1} = \min \norm{Aw-b}{2}^{2} + l \norm{w}_{1}\). Allegedly aka least absolute shrinkage and selection operator.
Same as minimizing f(w) subject to \(\norm{w}{1} < c\). As you increase c, you get less and less sparse solutions. When \(c\geq \norm{\hat{w}}{1}\) where
Want to get optimality condition
Get conditions useful in Lasso solving modification of Least angles regression: Let
Importance
Often yields sparse solutions: from geometry.
For importance of finding sparse models, see statistics survey.
The geometry
\(\norm{w}{1} < c\) is hyper-rhombus, f(w) is a paraboloid, whose each level set is an ellipsoid. In general, w for min f(w) will lie outside the hyper-rhombus. So, the optimal w: the point where the edge of the hyper-rhombus meets the t-level set: f(w) = t, for the smallest t. Usually this happens where 2 edges meet, or where many \(w{i}\) are 0: visualize in 2D; hence sparsity.
Compare with ridge regression.