Goals and formulations are presented elsewhere.
Binary classification
See the many hypothesis classes/ parametrized model families in the colt ref, boolean functions ref.
Non parametric methods
k nearest neighbors
A discriminative, non parametric approach. Number of examples: N. Samples:
So, can’t ever throw away data points.
Linear models for discrete classification
Linear separability of data sets or feature space: Decision surfaces are (p-1) dim hyperplanes in feature space.
Decision surfaces are h(x, v) = constant or
For binary classification, this becomes a halfspace: see boolean function ref and colt ref.
For geometric properties of separating hyperplane, see boolean function ref.
Arbitrary separator from fully separable training set
When the training set is fully separable, one can pick one of the many separating hyperplane easily, for example, using linear programming. For other such algorithms, see computational learning theory ref.
To select the best among the candidate hyperplanes, one can use some sort of regularization, as in maximum margin classifiers.
Winnow: multiplicative update
Let weight of
This is very similar to the ‘panel of experts’ algorithm. Also see note in the section on perceptron algorithm comparing this with the ‘panel of experts’ algorithm.
Perceptron learning alg for halfspaces
The problem
The classifier: A hyperplane c through origin perfectly classifies labeled data; unit vector
The data:
Want to find
The algorithm
Convergence to u
By induction, using update rule expression for
Also, by induction, the update rule and the fact that
So,
Comparison
Perceptron Algorithm is usually faster than than LP. Is exponential when
For a given
Has connection to halfspace learning with noise. \why
Assumes that the data is perfectly separable. So, often less preferable than soft margin SVM’s. But, a soft margin variant of perceptron algorithm is known.
With panel of experts algorithm
Compare the perceptron algorithm
Upon making a mistake,
The weights used in
Maximum margin classifier
The problem
A discriminative, parametric approach. Number of examples: N. Samples:
Suppose
Hard margin
Primal
To maximize margin, solve:
Can solve using Quadratic programming (QP).
Dual
Get Lagrangian
Can solve using QP too. This form is useful where
KKT conditions
Primal feasible:
So,
Predictor
Substituting for w, get:
Soft margins
Allow some points to be misclassified or to be below the margin, but linearly penalize such outliers.
Primal
So, use slack variables:
Dual
Lagrangian:
KKT conditions
Primal feasible:
So, support vectors now are points on or within certain margin from hyperplane. Predictor same as hard margin case.