Support estimation

Estimate support of a distribution D

Find set $S^{'}$ such that $P r (x \notin S^{'}) < p \in (0, 1]$ , given sample $S$ . Can be solved by probability density estimation techniques, but actually simpler.

Visualization: take the input space; draw solid ovals around sampled points; the algorithm will draw a dotted oval around these, which will represent the support of the distribution.

With soft margin kernel hyperplane

Aka One Class SVM or OSVM.

Given $N$ examples ${x_{i}}$ ; project to some feature space associated with kernel $k (x, y) = ϕ (x)^{T} ϕ (y)$ ; want to find hyperplane $w^{T} ϕ (x) - ρ$ such that all points in the support fall on one side of the hyperplane, outliers fall on the other side: support identifier $f = s g n (w^{T} x - ρ)$ ; so, allowing a soft margin, want to solve $max_{ρ, w} \frac{ρ}{∥ w ∥} + C \sum ξ_{i}$ such that $w^{T} ϕ (x_{i}) + ξ_{i} \geq ρ, ξ_{i} \geq 0$ ; $\equiv$ obj function: $min_{w, ξ, ρ} {∥ w ∥}^{2} / 2 + \frac{1}{ν N} \sum ξ_{i} - ρ$ , for some coefficient $0 \leq ν \leq 1$ .

Thence get Lagrangian: \ $L (w, ξ, ρ, α, β) = {∥ w ∥}^{2} / 2 + \frac{1}{ν N} \sum ξ_{i} - ρ - \sum α_{i} (w^{T} ϕ (x_{i}) + ξ_{i} - ρ) - \sum β_{i} ξ_{i}$ with $α, β \geq 0$ .

Set derivatives wrt primal vars $w, ξ, ρ$ to 0 to get: $w = \sum_{i} α_{i} ϕ (x_{i}); α_{i} = \frac{1}{ν N} - β_{i} \leq \frac{1}{ν N}, \sum_{i} α_{i} = 1$ . Thence, the support identifier becomes \ $f = s g n (\sum_{i} α_{i} k (x_{i}, x) - ρ)$ ; dual optimization problem becomes \ $m a x_{α} - 2^{- 1} \sum_{i, j} α_{i} α_{j} k (x_{i}, x_{j})$ subject to $0 \leq α_{i} \leq (ν N)^{- 1}, \sum_{i} α_{i} = 1$ . Solving this, discover w; then recover $ρ$ using $ρ = w^{T} ϕ (x_{i})$ for $x_{i}$ with $α_{i} \neq 0; β_{i} \neq 0$ (support vector with $β_{i} > 0$ ): $\exists x_{i}$ as $\sum α_{i} = 1; α_{i} \geq 0$ .

Choosing kernel, tuning parameters

$ν \propto$ softness of the margin, number of support vectors, thence the runtime, sensitivity to appearence of novelty.

With Gaussian kernel, any data set is seperable as everything is mapped to same quadrant in feature space.

\oprob How to decide width of Gaussian kernel to use? Can you use information about the abnormal class in choosing the kernel?

Comparison with thresholded Kernel Density estimator

If $ν = 1, α_{i} = 1 / N$ , support identifier $f = s g n (\sum_{i} α_{i} k (x_{i}, x) - ρ)$ same as one using a Kernel (Parzen) Density estimator. What happens when $ν < 1$ ?

Comparison with using soft margin hyperspheres

For homogenous kernels, $k (x, x)$ is a constant and the dual minimization problem \ $min_{α} \sum_{i, j} α_{i} α_{j} k (x_{i}, x_{j}) - \sum_{i} α_{i} k (x_{i}, x_{i})$ and the support identifier \ $f = s g n (R^{2} - \sum_{i, j} α_{i} α_{j} k (x_{i}, x_{j}) + 2 \sum_{i} k (x_{i}, x) - k (x, x))$ is equivalent to the minimization problem derived from the hyperplane formulation. So, all mapped patterns lie in a sphere in feature space; finding the smallest sphere containing them is equivalent to finding the segment of the sphere containing the data points, which reduces to finding the separating hyperplane.

Connection to binary classification

Hyperplane $(w, ρ = 0)$ \ separates ${(x_{i}, 1)}$ from $(- x_{i}, - 1)$ with margin $ρ / ∥ w ∥$ and vice-versa.

Using soft margin hyperspheres

Aka Support vector data description. Here one solves: $min_{R, ξ, c} R^{2} + \frac{1}{ν N} \sum_{i} ξ_{i}$ subject to ${∥ ϕ (x_{i} - c) ∥}^{2} - ξ_{i} \leq R^{2}, ξ_{i} \geq 0$ .

After using the Lagrangian, finding the critical points and substituting, this leads to the dual $min_{α} \sum_{i, j} α_{i} α_{j} k (x_{i}, x_{j}) - \sum_{i} α_{i} k (x_{i}, x_{i})$ subject to $0 \leq α_{i} \leq \frac{1}{ν N}, \sum α_{i} = 1$ , and the solution $c = \sum α_{i} ϕ (x_{i})$ corresponding to support identifier $f = s g n (R^{2} - \sum_{i, j} α_{i} α_{j} k (x_{i}, x_{j}) + 2 \sum_{i} k (x_{i}, x) - k (x, x))$ \chk.

Using Clustering

Cluster the sample, draw boundaries around the clusters. Eg: Use $k$ means clustering.

Novelty detection

Problem

Aka Outlier detection.

In general, we want to find outliers - unlikely data-points according to the conditional distributions $f_{X_{r} | X_{\neg r}}$ .

As One class classification

view as a problem where there are multiple classes, but all training examples are from one class only.

Motivation

Outliers are detected either to focus attention on them or to remove them from consideration.

Using density estimation

Do density estimation; call apparently improbable data points novel.

Using support of the distribution

Find distrubution support, call anything outside the support an outlier.

Ransack

One learns a model $M$ (either $f_{X_{r} | X_{\neg r}}$ or $E [f_{X_{r} | X_{\neg r}}]$ ) using data set $S$ .

Then, one finds $S^{'} \subset S$ for which $e r r (M; x) \geq t \forall x \in S^{'}$ .

$S^{'}$ is then added to the set of outliers.

Finally, one repeats the entire procedure till the set of outliers is stable.

Boundary methods

K nearest neighbors

Estimate local density of $x$ by taking avg distance to $k$ nearest neighbors; similarly estimate local density of each neighbor; call $x$ novel if its local density is much smaller than that of the neighbors.

Support vector data description

\tbc

PCA

Simplify the data using PCA. \tbc