Regression

General problem

Many (x, y) pairs (observations), h(x, w) form (eg: degree of polynomial) known, parameters $w$ in h(x, w) is unknown. Want to find $w$ . range(h) is not finite, it is usually continuous.

y is the response variable, $w$ is called the regression vector. The matrix formed by $x$ is often called the design matrix.

Many such continuous valued models are described in probabilistic models reference.

Linear regression

The problem

h(x, w) is linear in $w$ (Eg: $w_{1} x^{3} + w_{2}$ x $+ w_{3} = 0$ : $ϕ_{i} (x) = x^{i}$ ).

Make matrix A with each row as feature vector $ϕ (x)$ at data point $x$ ; take b at diff points; coefficients as variable vector $w$ .

The solution

Want to tune $w$ so that it yields least deviation from b. You measure this using various loss functions.

Quadratic loss function

Get least squares problem $min e (w) = min {∥ A w - b ∥}_{2}$ . This is symmetric, but is sensitive to outliers.

Maximum likelihood estimate with Gaussian noise

If you view y as h(x, w) + gaussian noise n, least squares solution is also the maximum likelihood solution.

Noise distribution symmetric about the mean is not sufficient to lead to least squares solution: $min e (w) = min {∥ A w - b ∥}_{4}$ symmetrically penalizes deviation from mean just as well.

Imposing prior distributions on w

Solutions below assume quadratic loss function to measure deviation from b. Priors implied by regularizers in $\min e(w) = \min \norm{Aw - b}{2} + p(w)$ where p is some penalty function. Usually $p(w) = \norm{w}{k}$.

Quadratic regularizer

Assuming gaussian noise, the maximum a-posteriori solution yields the ridge regression problem.

Priors which prefer sparse w

Can use lasso, or compressed sensing. See optimization ref.

Statistical efficiency

N samples, $d$ dimensions. $E [{∥ \hat{θ} - θ^{*} ∥}_{2}] \leq \sqrt{\frac{s \log d}{N}}$ .

Solution

See optimization ref.