Data representation

Feature extraction

Use some \(\ftr\): d-dim X formed by input vars \(\to\) m-dim Feature space; \(\ftr\) is a vector function. Basis of feature space are the ‘basis functions’ \((\ftr_{i})\). \tbc

Dimensionality reduction

Remove irrelevant/ less relevant features, merge duplicate features. Eg: synonymous words in documents.

Importance

Treating duplicate features as if they were different harms ability to classify or cluster data points. Irrelevent features also harm predictive ability, by acting as noise.

Data visualization

Project high dimensional data to 2D or 3D space. Also see graph drawing/ embedding.

Extant of dimensionality reduction

The best dimension should match the inherent dimensionality of the data.

If there are not-very-useful features, there will be a steep drop in the performance gain by including those features. Often this is not the case, and the choice is rather arbitrary.

Using Kernel function to implicitly map data to a feature space

See vector spaces ref for info about kernels.

The Kernel trick

Aka Kernel substitution.

Can reformulate hypothesis model, perhaps a predictor, and any associated optimization objective such that input vector \(x\) enters only in terms of inner products. Then can substitute this product with kernel function \(k(x, x’)\); implicitly using some \(\ftr(x)\); can try out various kernels to achieve good performance. Can also use +ve semi-definite d*d kernel matrix K to define inner product in feature space.

Use

Learning problem; classification or clustering; may be easier to solve in some kernel. Eg: Can’t find linear separator for 2 concentric rings of points in original space, but can do so in a high dimensional space.

Save space: If features are high dimensional, may not want to explicitly form feature vectors. Just use the kernel function.

Casting data into a graph

Take data \(\set{X_{i}}\); use similarity measure \(S(X_{i}, X_{j})\); Cast into weighted similarity graph: Set \(w_{u, v} = f(S(u, v))\). Casting data into a graph simplifies the clustering task: there you only care about the distance.

\htext{\(\eps\){eps} neighborhood graph}

\((u, v) \in E \equiv d(u, v) < \eps\), \ where \(d(u, v) = g(S(u, v))\). Usually unweighted as \(\eps \) distance does not matter.

k nearest neighbor directed graph

\((u, v) \in E \equiv v \in KNN(u)\).

Fully connected graph

Useful if S(u, v) like Gaussian kernel: \(e^{-\norm{u-v}^{2}/(2\stddev^{2})}\).

Find similar features

With Clustering

Coclustering does this.

Find covariance of various features

Sample points and their mean

Each sample pt \(X^{(i)}\) is a \(d\) dim vector. The mean, \(E[X] = \mean\); sample mean is \(\bar{X} = n^{-1} \sum_i X^{(i)}\).

Sample Covariance matrix C

Estimates \(\covmatrix\). So, \(C_{i,j} = (n-1)^{-1}\sum_k (X_i^{(k)} - \bar{X_i})(X_j^{(k)} - \bar{X_j})\) estimates \(cov(X_i, X_j)\). So, \(C = (n-1)^{-1}\sum_i (X^{(i)} - \bar{X})^{T}(X^{(i)} - \bar{X})\).

So, \(C\) is symmetric +ve semi-definite; and if \(n<d\), \(C\) is singular.

Identify a good metric

Generalized interpoint distance

Aka Mahalonobis distance. Take covariance matrix C, \(E[X] = E[Y] = \mean\); X, Y sample points. \(d(X, Y) = ((X-Y)^{T}C^{-1}(X-Y))^{1/2}\). Negates bias due to having features which mostly say the same thing while finding distance.