With fully labelled data

Goals and formulations are presented elsewhere.

Binary classification

See the many hypothesis classes/ parametrized model families in the colt ref, boolean functions ref.

Non parametric methods

k nearest neighbors

A discriminative, non parametric approach. Number of examples: N. Samples: \(S = (x_{1}, .. x_{N})\). There are \(k\) classes. To classify \(x\), find \(k\) nearest neighbors in S; take their majority vote.

So, can’t ever throw away data points.

Linear models for discrete classification

Linear separability of data sets or feature space: Decision surfaces are (p-1) dim hyperplanes in feature space.

\(h(x, w) = f(w^{T}x + w_{0}) = f(v^{T}x’)\) with \(x’ = (x, 1)\): \(f\) is activation function or link function; \(w\) is the weight vector; \(w_{0}\) is the bias. So without loss of generality, we can restrict ourselves to considering only hyperplanes passing through the origin.

Decision surfaces are h(x, v) = constant or \(v^{T}x’\) = constant, so linear in \(x\); Not linear in terms of \(w\) due to possible non linearity of \(f\): so called generalized linear model.

For binary classification, this becomes a halfspace: see boolean function ref and colt ref.

For geometric properties of separating hyperplane, see boolean function ref.

Arbitrary separator from fully separable training set

When the training set is fully separable, one can pick one of the many separating hyperplane easily, for example, using linear programming. For other such algorithms, see computational learning theory ref.

To select the best among the candidate hyperplanes, one can use some sort of regularization, as in maximum margin classifiers.

Winnow: multiplicative update

Let weight of \(c\) be \(W\). Set weights \(a_{i} = 1\). Multiplicative update rule: if \(x_{i}\) agrees with c(x), set \(a_{i} = a_{i}(1+\del)\); set \(a_{i} = a_{i}/(1+\del)\) for others. \(mb = O(W^{2} \log n)\). \why Sometimes better than additive update rule used in the perceptron algorithm.

This is very similar to the ‘panel of experts’ algorithm. Also see note in the section on perceptron algorithm comparing this with the ‘panel of experts’ algorithm.

Perceptron learning alg for halfspaces

The problem

The classifier: A hyperplane c through origin perfectly classifies labeled data; unit vector \(u \perp c\) defines c; \(c(x_i) = sgn(\dprod{u, x_i})\).

The data: \(x \in R^{n}\); \(\norm{x_i} = 1\); \(S = \set{x_i}\). Geometric margin of X wrt u: \(g = \min_{x \in S} |\dprod{u, x}|\); or \(sgn(\dprod{u, x_i})\dprod{u,x_{i}} \geq g\). Note: g = function(S).

Want to find \(c\).

The algorithm

\(u_{0} = 0\). Additive update rule: If mistake on \(x_i\): \(u_{i+1} = u_{i} + sgn(\dprod{u, x_i})x_i\): hyperplane orientation tilted towards correcting the mistake.

Convergence to u

\(mb = O(g^{-2})\). In general, if \(\norm{x_i} \leq R\); \(mb = O((\frac{R}{g})^{2})\).

By induction, using update rule expression for \(u_t\): \(\norm{u_t} = \norm{u_t}\norm{u} \geq \dprod{u, u_t} \geq tg\). So, if the length of \(u_t\) is not increasing too much, may be perceptron is getting closer to u as more mistakes made.

Also, by induction, the update rule and the fact that \(u_{t-1}\) misclassified \(x_{t-1}\), causing the update: \(\norm{u_{t}}^{2} \leq t\).

So, \((tg)^{2} \leq t\); and \(t\leq g^{-2}\).

Comparison

Perceptron Algorithm is usually faster than than LP. Is exponential when \(g \leq 2^{-c}\): this is rare.

For a given \(g\), we can find good enough halfspace with mb \(O((\frac{R+D}{g})^{2})\). \chk Perhaps the winnow algorithm, which uses multiplicative update is more efficient.

Has connection to halfspace learning with noise. \why

Assumes that the data is perfectly separable. So, often less preferable than soft margin SVM’s. But, a soft margin variant of perceptron algorithm is known.

With panel of experts algorithm

Compare the perceptron algorithm \(B\) with the algorithm \(A\) described in the learning theory survey, which for any given sequence of inputs, using a panel of experts achieves a mistake bound comparable with the mistake bound achieved by the best expert. In the case of halfspaces, every input bit \(x_i\) can be viewed as an ’expert’.

Upon making a mistake, \(A\) updates the weight of only the experts which made a mistake, whereas \(B\) updates weight assigned to every expert.

The weights used in \(A\) were all positive, whereas weights used in \(B\) can be negative: but this distinction is minor, as it can perhaps be accounted for in the ’experts algorithm’ by introducing experts corresponding to \(-x_i\).

Maximum margin classifier

The problem

A discriminative, parametric approach. Number of examples: N. Samples: \((x_{1}, .. x_{N})\), label function \(c:X \to \set{\pm 1}\).

Suppose \(y(x) = w^{T} \ftr(x) + w_0\) with \(y(x)c(x) > 0 \forall x\), for some \(w, w_0, \ftr\). So finding a separating hyperplane (see halfspaces in boolean function ref) in some feature space.

Hard margin

Primal

To maximize margin, solve: \(\max_{w,w_0}[\frac{\min_{n}[y(x_{n})c(x_{n})]}{\norm{w}}]\). Scale w, \(w_0\) so that \(\min_{n}[y(x_{n})c(x_{n})] = 1\); thence get \(\equiv\) problem \(\min_{w,w_0}\frac{\norm{w}^{2}}{2}\): \(y(x_{n})c(x_{n}) \geq 1\). Prediction: sgn(y(x)).

Can solve using Quadratic programming (QP).

Dual

Get Lagrangian \(L(w, w_0, a) = \frac{\norm{w}^{2}}{2} + \sum a_{n}[1-(w^{T} \ftr(x_n) + w_0)c(x_{n})]\); \(a_{n} \geq 0\). Get dual: \(g(a) = \inf_{w, w_0} L(w, w_0, a)\); Set \(\gradient_{w, w_0} L(w, w_0, a) = 0\): \(w = \sum_{n}a_{n}c(x_{n}) \ftr(x_{n}); 0 = \sum a_{n}c(x_{n})\). So, dual problem: \(\max_a g(a) = \max \sum a_{n} - 2^{-1}\sum_{n}\sum_{m} a_{n}a_{m}c(x_{n})c(x_{m})k(x_{n}, x_{m})\): \(a_{n} \geq 0; \sum a_{n}c(x_{n}) = 0\).

Can solve using QP too. This form is useful where \(dim(\ftr(x)) »N\).

KKT conditions

Primal feasible: \(y(x_{n})c(x_{n}) \geq 1\). Dual feasible: \(a_{n} \geq 0; \sum a_{n}c(x_{n}) = 0\). Complementary slackness: \(a_{n}[1 - c(x_{n}) y(x_{n})] = 0\).

So, \(\forall n: a_{n} = 0 \lor c(x_{n})y(x_{n})= 1\): in latter case, you have support vectors. So, aka Support Vector Machine (SVM). Take S: number of support vectors.

Predictor

Substituting for w, get: \(y(x) = \sum_{n}a_{n}c(x_{n})k(x_{n}, x) + w_0\): only S terms actually appear with \(a_{n} \neq 0\): so SVM is fast to evaluate. So, \(w_0 = \frac{\sum_{m} [c(x_{m})y(x_m) - \sum_n a_{n}k(x_{n}, x_{m})]}{N}\).

Soft margins

Allow some points to be misclassified or to be below the margin, but linearly penalize such outliers.

Primal

So, use slack variables: \(\gx_{n} \geq 0\); Instead of \(y(x_{n})c(x_{n}) \geq 1\), use constraint \(y(x_{n})c(x_{n}) + \gx_{n} \geq 1\).

\(\min C\sum_{n=1}^{N} \gx_{n} + \frac{\norm{w}^{2}}{2}\): \(C\) is tradeoff between penalty on \(\gx\) and margin; saying \(\sum \gx_i \leq G\); so controls model complexity. As \(C \to \infty\), get hard margin SVM.

Dual

Lagrangian: \(L(w, w_0, \gx, a, m) = \frac{\norm{w}^{2}}{2} + C\sum_{n=1}^{N} \gx_{n} + \sum a_{n}(1 - c(x_{n})y(x_{n}) - \gx_{n}) + \sum \gm_{n}\gx_{n}\): \(a \geq 0, \gm \geq 0\). Set \(\gradient_{w, w_0, \gx} \)L\( = 0\): \(w = \sum a_{n}c(x_{n})\ftr(x_{n}), \sum a_{n}c(x_{n}) = 0, a_{n} = \)C\( - \gm_{n}\). Thence get dual g(a), with objective function same as hard-margin case with constraints: \(0 \leq a_{n} \leq C\): as \(\gm_{n} \geq 0; \sum a_{n}c(x_{n}) = 0\).

KKT conditions

Primal feasible: \(1- c(x_{n})y(x_{n}) - \gx_{n} \leq 0\). Dual feasible: \(a_{n} \geq 0, \gm_{n} \geq 0\). Complimentary slackness: \(a_{n}(1 - c(x_{n})y(x_{n}) - \gx_{n}) = 0, \gm_{n}\gx_{n} = 0\).

So, support vectors now are points on or within certain margin from hyperplane. Predictor same as hard margin case.