With fully labelled data

Goals and formulations are presented elsewhere.

Binary classification

See the many hypothesis classes/ parametrized model families in the colt ref, boolean functions ref.

Non parametric methods

k nearest neighbors

A discriminative, non parametric approach. Number of examples: N. Samples: S=(x1,..xN). There are k classes. To classify x, find k nearest neighbors in S; take their majority vote.

So, can’t ever throw away data points.

Linear models for discrete classification

Linear separability of data sets or feature space: Decision surfaces are (p-1) dim hyperplanes in feature space.

h(x,w)=f(wTx+w0)=f(vTx) with x=(x,1): f is activation function or link function; w is the weight vector; w0 is the bias. So without loss of generality, we can restrict ourselves to considering only hyperplanes passing through the origin.

Decision surfaces are h(x, v) = constant or vTx = constant, so linear in x; Not linear in terms of w due to possible non linearity of f: so called generalized linear model.

For binary classification, this becomes a halfspace: see boolean function ref and colt ref.

For geometric properties of separating hyperplane, see boolean function ref.

Arbitrary separator from fully separable training set

When the training set is fully separable, one can pick one of the many separating hyperplane easily, for example, using linear programming. For other such algorithms, see computational learning theory ref.

To select the best among the candidate hyperplanes, one can use some sort of regularization, as in maximum margin classifiers.

Winnow: multiplicative update

Let weight of c be W. Set weights ai=1. Multiplicative update rule: if xi agrees with c(x), set ai=ai(1+δ); set ai=ai/(1+δ) for others. mb=O(W2logn). \why Sometimes better than additive update rule used in the perceptron algorithm.

This is very similar to the ‘panel of experts’ algorithm. Also see note in the section on perceptron algorithm comparing this with the ‘panel of experts’ algorithm.

Perceptron learning alg for halfspaces

The problem

The classifier: A hyperplane c through origin perfectly classifies labeled data; unit vector uc defines c; c(xi)=sgn(u,xi).

The data: xRn; xi=1; S={xi}. Geometric margin of X wrt u: g=minxS|u,x|; or sgn(u,xi)u,xig. Note: g = function(S).

Want to find c.

The algorithm

u0=0. Additive update rule: If mistake on xi: ui+1=ui+sgn(u,xi)xi: hyperplane orientation tilted towards correcting the mistake.

Convergence to u

mb=O(g2). In general, if xiR; mb=O((Rg)2).

By induction, using update rule expression for ut: ut=utuu,uttg. So, if the length of ut is not increasing too much, may be perceptron is getting closer to u as more mistakes made.

Also, by induction, the update rule and the fact that ut1 misclassified xt1, causing the update: ut2t.

So, (tg)2t; and tg2.

Comparison

Perceptron Algorithm is usually faster than than LP. Is exponential when g2c: this is rare.

For a given g, we can find good enough halfspace with mb O((R+Dg)2). \chk Perhaps the winnow algorithm, which uses multiplicative update is more efficient.

Has connection to halfspace learning with noise. \why

Assumes that the data is perfectly separable. So, often less preferable than soft margin SVM’s. But, a soft margin variant of perceptron algorithm is known.

With panel of experts algorithm

Compare the perceptron algorithm B with the algorithm A described in the learning theory survey, which for any given sequence of inputs, using a panel of experts achieves a mistake bound comparable with the mistake bound achieved by the best expert. In the case of halfspaces, every input bit xi can be viewed as an ’expert’.

Upon making a mistake, A updates the weight of only the experts which made a mistake, whereas B updates weight assigned to every expert.

The weights used in A were all positive, whereas weights used in B can be negative: but this distinction is minor, as it can perhaps be accounted for in the ’experts algorithm’ by introducing experts corresponding to xi.

Maximum margin classifier

The problem

A discriminative, parametric approach. Number of examples: N. Samples: (x1,..xN), label function c:X{±1}.

Suppose y(x)=wTϕ(x)+w0 with y(x)c(x)>0x, for some w,w0,ϕ. So finding a separating hyperplane (see halfspaces in boolean function ref) in some feature space.

Hard margin

Primal

To maximize margin, solve: maxw,w0[minn[y(xn)c(xn)]w]. Scale w, w0 so that minn[y(xn)c(xn)]=1; thence get problem minw,w0w22: y(xn)c(xn)1. Prediction: sgn(y(x)).

Can solve using Quadratic programming (QP).

Dual

Get Lagrangian L(w,w0,a)=w22+an[1(wTϕ(xn)+w0)c(xn)]; an0. Get dual: g(a)=infw,w0L(w,w0,a); Set w,w0L(w,w0,a)=0: w=nanc(xn)ϕ(xn);0=anc(xn). So, dual problem: maxag(a)=maxan21nmanamc(xn)c(xm)k(xn,xm): an0;anc(xn)=0.

Can solve using QP too. This form is useful where dim(ϕ(x))»N.

KKT conditions

Primal feasible: y(xn)c(xn)1. Dual feasible: an0;anc(xn)=0. Complementary slackness: an[1c(xn)y(xn)]=0.

So, n:an=0c(xn)y(xn)=1: in latter case, you have support vectors. So, aka Support Vector Machine (SVM). Take S: number of support vectors.

Predictor

Substituting for w, get: y(x)=nanc(xn)k(xn,x)+w0: only S terms actually appear with an0: so SVM is fast to evaluate. So, w0=m[c(xm)y(xm)nank(xn,xm)]N.

Soft margins

Allow some points to be misclassified or to be below the margin, but linearly penalize such outliers.

Primal

So, use slack variables: ξn0; Instead of y(xn)c(xn)1, use constraint y(xn)c(xn)+ξn1.

minCn=1Nξn+w22: C is tradeoff between penalty on ξ and margin; saying ξiG; so controls model complexity. As C, get hard margin SVM.

Dual

Lagrangian: L(w,w0,ξ,a,m)=w22+Cn=1Nξn+an(1c(xn)y(xn)ξn)+μnξn: a0,μ0. Set w,w0,ξL=0: w=anc(xn)ϕ(xn),anc(xn)=0,an=Cμn. Thence get dual g(a), with objective function same as hard-margin case with constraints: 0anC: as μn0;anc(xn)=0.

KKT conditions

Primal feasible: 1c(xn)y(xn)ξn0. Dual feasible: an0,μn0. Complimentary slackness: an(1c(xn)y(xn)ξn)=0,μnξn=0.

So, support vectors now are points on or within certain margin from hyperplane. Predictor same as hard margin case.