Support estimation

Estimate support of a distribution D

Find set S such that Pr(xS)<p(0,1], given sample S. Can be solved by probability density estimation techniques, but actually simpler.

Visualization: take the input space; draw solid ovals around sampled points; the algorithm will draw a dotted oval around these, which will represent the support of the distribution.

With soft margin kernel hyperplane

Aka One Class SVM or OSVM.

Given N examples {xi}; project to some feature space associated with kernel k(x,y)=ϕ(x)Tϕ(y); want to find hyperplane wTϕ(x)ρ such that all points in the support fall on one side of the hyperplane, outliers fall on the other side: support identifier f=sgn(wTxρ); so, allowing a soft margin, want to solve maxρ,wρw+Cξi such that wTϕ(xi)+ξiρ,ξi0; obj function: minw,ξ,ρw2/2+1νNξiρ, for some coefficient 0ν1.

Thence get Lagrangian: \ L(w,ξ,ρ,α,β)=w2/2+1νNξiραi(wTϕ(xi)+ξiρ)βiξi with α,β0.

Set derivatives wrt primal vars w,ξ,ρ to 0 to get: w=iαiϕ(xi);αi=1νNβi1νN,iαi=1. Thence, the support identifier becomes \ f=sgn(iαik(xi,x)ρ); dual optimization problem becomes \ maxα21i,jαiαjk(xi,xj) subject to 0αi(νN)1,iαi=1. Solving this, discover w; then recover ρ using ρ=wTϕ(xi) for xi with αi0;βi0 (support vector with βi>0): xi as αi=1;αi0.

Choosing kernel, tuning parameters

ν softness of the margin, number of support vectors, thence the runtime, sensitivity to appearence of novelty.

With Gaussian kernel, any data set is seperable as everything is mapped to same quadrant in feature space.

\oprob How to decide width of Gaussian kernel to use? Can you use information about the abnormal class in choosing the kernel?

Comparison with thresholded Kernel Density estimator

If ν=1,αi=1/N, support identifier f=sgn(iαik(xi,x)ρ) same as one using a Kernel (Parzen) Density estimator. What happens when ν<1?

Comparison with using soft margin hyperspheres

For homogenous kernels, k(x,x) is a constant and the dual minimization problem \ minαi,jαiαjk(xi,xj)iαik(xi,xi) and the support identifier \ f=sgn(R2i,jαiαjk(xi,xj)+2ik(xi,x)k(x,x)) is equivalent to the minimization problem derived from the hyperplane formulation. So, all mapped patterns lie in a sphere in feature space; finding the smallest sphere containing them is equivalent to finding the segment of the sphere containing the data points, which reduces to finding the separating hyperplane.

Connection to binary classification

Hyperplane (w,ρ=0) \ separates {(xi,1)} from (xi,1) with margin ρ/w and vice-versa.

Using soft margin hyperspheres

Aka Support vector data description. Here one solves: minR,ξ,cR2+1νNiξi subject to ϕ(xic)2ξiR2,ξi0.

After using the Lagrangian, finding the critical points and substituting, this leads to the dual minαi,jαiαjk(xi,xj)iαik(xi,xi) subject to 0αi1νN,αi=1, and the solution c=αiϕ(xi) corresponding to support identifier f=sgn(R2i,jαiαjk(xi,xj)+2ik(xi,x)k(x,x)) \chk.

Using Clustering

Cluster the sample, draw boundaries around the clusters. Eg: Use k means clustering.

Novelty detection

Problem

Aka Outlier detection.

In general, we want to find outliers - unlikely data-points according to the conditional distributions fXr|X¬r.

As One class classification

view as a problem where there are multiple classes, but all training examples are from one class only.

Motivation

Outliers are detected either to focus attention on them or to remove them from consideration.

Using density estimation

Do density estimation; call apparently improbable data points novel.

Using support of the distribution

Find distrubution support, call anything outside the support an outlier.

Ransack

One learns a model M (either fXr|X¬r or E[fXr|X¬r]) using data set S.

Then, one finds SS for which err(M;x)txS.

S is then added to the set of outliers.

Finally, one repeats the entire procedure till the set of outliers is stable.

Boundary methods

K nearest neighbors

Estimate local density of x by taking avg distance to k nearest neighbors; similarly estimate local density of each neighbor; call x novel if its local density is much smaller than that of the neighbors.

Support vector data description

\tbc

PCA

Simplify the data using PCA. \tbc