Regression

General problem

Many (x, y) pairs (observations), h(x, w) form (eg: degree of polynomial) known, parameters \(w\) in h(x, w) is unknown. Want to find \(w\). range(h) is not finite, it is usually continuous.

y is the response variable, \(w\) is called the regression vector. The matrix formed by \(x\) is often called the design matrix.

Many such continuous valued models are described in probabilistic models reference.

Linear regression

The problem

h(x, w) is linear in \(w\) (Eg: \(w_1 x^{3}+ w_2 \)x\( + w_3=0\): \(\ftr_{i}(x) = x^{i}\)).

Make matrix A with each row as feature vector \(\ftr(x)\) at data point \(x\); take b at diff points; coefficients as variable vector \(w\).

The solution

Want to tune \(w\) so that it yields least deviation from b. You measure this using various loss functions.

Quadratic loss function

Get least squares problem \(\min e(w) = \min \norm{Aw - b}_{2}\). This is symmetric, but is sensitive to outliers.

Maximum likelihood estimate with Gaussian noise

If you view y as h(x, w) + gaussian noise n, least squares solution is also the maximum likelihood solution.

Noise distribution symmetric about the mean is not sufficient to lead to least squares solution: \(\min e(w) = \min \norm{Aw - b}_{4}\) symmetrically penalizes deviation from mean just as well.

Imposing prior distributions on w

Solutions below assume quadratic loss function to measure deviation from b. Priors implied by regularizers in \(\min e(w) = \min \norm{Aw - b}{2} + p(w)\) where p is some penalty function. Usually \(p(w) = \norm{w}{k}\).

Quadratic regularizer

Assuming gaussian noise, the maximum a-posteriori solution yields the ridge regression problem.

Priors which prefer sparse w

Can use lasso, or compressed sensing. See optimization ref.

Statistical efficiency

N samples, \(d\) dimensions. \(E[\norm{\hat{\gth} - \gth^{*}}_2] \leq \sqrt{\frac{s \log d}{N}}\).

Solution

See optimization ref.