01 Distribution models

Discrete L: Response probability: Discriminative models

Boolean valued functions

One can use boolean valued functions to get deterministic models of the form \(y = f(x)\). These functions are considered in the boolean functions survey and the computational learning theory survey.

Probability from regression models

Take any (continuous variable regression) model \(f: X \to [0, 1]\). Such a model can be interpreted as modeling the probability distribution \(f_L\).

Advantages of modeling probability

The classifier doesn’t care whether \(C_{1}\) is called class 1 or class 100. So, better than solving regression problem with y as the target.

Model numeric labels with regression models

One may use regression models together with an appropriate round-off function to model discrete numerical labels.

Dependence on choice of ran(Y)

For the same k-classification problem, different choices of Y corresponding to \(\set{L_{i}}\) can yield different models classifiers. Ideally they should be independent of choice of labels. So, logistic regression preferred.

Eg: For binary classification problem, picking \(L_{i} = \set{\pm 1}\) yields different model from picking \(L_{i} = \set{\frac{N}{n_{1}}, - \frac{N}{n_{2}}}\), which yields fisher’s linear discriminant!

y in 1 of k binary encoding format

Make matrix X with rows \([1 x_{i}^{T}]\). Make Y with rows \(y_{i}^{T}\). Want to find parameters W such that \(XW \approx Y\). Can try \(\min_{W} \norm{XW-Y}_{F}^{2}\), get solution: \((XX^{T})\hat{W} = X^{T}Y\). But \(X\hat{W}\) can have -ve numbers which approximate Y; so not very desirable technique

Logistic model

Got k-class classification problem. Want to model class probabilities or log odds and make classification decision.

Log linear model for class probabilities

\(\forall i \in [1:k]: Pr(C = i|x) \propto e^{w_{i0} + w_i^{T}x}\). So, \(Pr(C = i|x) = \frac{e^{w_{i0} + w_i^{T}x}}{\sum_j e^{w_{j0} + w_j^{T}x}}\).

\exclaim{But this is over parametrized}: The choice of w is constrained by the fact that specifying \(Pr(C = i|x) \forall i=1:k-1\) completely specifies the probability distribution.

Equivalent form: model log odds

\(\forall i\in [1:k-1]: \log\frac{Pr(C = i|x)}{Pr(C = k|x)} = w_{i0} + w_{i}^{T}x\).

Get: \(Pr(C = i|x) = \frac{e^{w_{i0}+ w_{i}^{T}x}}{1 + \sum_{j \neq k} e^{w_{j0}+ w_{j}^{T}x}}, Pr(C = k|x) = \frac{1}{1 + \sum_{j\neq k} e^{w_{i0}+ w_{i}^{T}x}}\).

Same as the model described in previous subsubsection, with all \(Pr(C = i)\) scaled to ensure that \(Pr(C = i|x) \propto e^{w_{i0} + w_i^{T}x} Pr(C = k|x)\): done by ensuring that \(w_k = 0\). Thus taking care of earlier overparametrization!

Symmetric notation

Let \(x \gets (1, x), w_i \gets (w_{i0}, w_i)\). \( $$Pr(C = i|x) = \frac{e^{\sum_{c \in \set{1, .. m-1}} w_{c}^{T}x I[c=i]}}{1 + \sum_{j \neq k} e^{\sum_c w_{c}^{T}x I[c=j]}}\)$$

2-class case

For 2 class case, these are logistic sigmoid functions, thence the name.

Risk factors interpretation

\(Pr(C_{i}|x)\) is modeled as a sigmoid function which \(\to 0\) as \(w_{i}^{T}x \to -\infty\) and \(\to 1\) as \(w_{i}^{T}x \to \infty\). So, can consider \(w_{i}\) as the vector of weights assigned to features \(\set{x_{j}}\). \(Sgn(w_{i})\) usually indicates type of correlation with \(C_{i}\), but could be reversed in order to compensate for weightage given to other features. Eg: \(C_{i}\) could be ‘has heart disease’, and features may be liquor, fat and tobacco consumption levels.

As a linear discriminant

Consider the binary classification case. Here, \(\log \frac{Pr(C = 1|x)}{1 - Pr(C = 1|x)} = w_0 + w^{T}x\). So, \(w_0 + w^{T}x>0 \equiv (Pr(c=1|x) > Pr(c=0|x))\)

Estimating parameters

Given observations \((x^{(i)}, c^{(i)})\), find w to \(\max_w Pr(c^{(i)}|x^{(i)}, w)\): maximum likelihood estimation.

Sparsity of model parameters

Sometimes, want w to be sparse or group sparse. In this case, for learning the parameters, lasso or group lasso is used.

Discrete L: Response probability: Generative models

Latent variable model

Assume that the parameter \(W = w\) actually generates lower dimensional \(L\), and that observation set \(X\) is generated from \(L\) using some stochastic transformation which is independent of \(w\).

\(L\) is called the latent variable.

Assume conditional independence of input variables

Aka Naive Bayes. \(Pr(L|\ftr(X)) \propto Pr(L) Pr(\ftr(X)|L) = Pr(L) \prod_{i} Pr(\ftr_{i}(X)|L) \). \(Pr(\ftr(X)|L) = Pr(L) \prod_{i} Pr(\ftr_{i}(X)|L)\) is the assumption. Model parameters \(Pr(\ftr_{i}(X)|L)\) and \(Pr(L)\) are estimated from the training set \(\set{(X_{i}, L_{i})}\).

Co-clustering in a way recovers things lost due to the ‘independence of probability of occurrence of features’ assumption. \tbc

One can conceive of a version of this classifier for the case where \(L, \ftr(X)\) are continuous. \oprob

Linear separator in some feature space

The decision boundary can be specified by \(\log Pr(l_1) + \sum_{i} \log Pr(\ftr_{i}(x)|l_1) = \log Pr(l_2) + \sum_{i} \log Pr(\ftr_{i}(x)|l_2)\).

Apply the following mapping for variables: \(y_{i,d} = I[\ftr_i(x) = d]\); and create a new set of parameters: \(w_{i, d} = \log Pr(\ftr_{i}(X) = d|l_1) - \log Pr(\ftr_{i}(X) = d|l_2)\), and \(w_0 = \log Pr(l_1) - \log Pr(l_2)\). Now, the decision boundary is just \(w_0 + w^{T}y = 0\), which is a linear separator.

Success in practice.

Often works well in practice. Eg: In document classification.

Discriminative counterpart

Its discriminative counterpart is the class of all linear classifiers in a certain feature space, which corresponds to logistic regression. That, in general works better given a lot of samples.

Use exponential family models

Specification

For \(ran(Y) = \pm1\): Let \(Pr(x|Y=i) \propto exp(\dprod{w_i, \ftr(x)})\), and \(Pr(Y=1) = p\).

So, the corresponding discriminative classifier is: \(Pr(y|x) = exp(\log(\frac{p}{1-p}) + \log(\frac{Z(w_0)}{Z(w_1)}) + \dprod{w_1 - w_0, \ftr(x)})\), which is a linear classifier.

The corresponding discriminative classifier can be deduced directly using logistic regression.

Tree structure assumptions

In estimating, it is important to use the family of tree strucutred graphical models: We can’t tractably compute \(Z(w)\) otherwise. \(w_i\) can be done efficiently by computing the spanning tree of a graph among nodes with edges weighted by mutual information (Chow Liu algorithm).

Otherwise, mixture of trees are also used.

Latent variable models: Expectation Maximization (EM) alg

Problem

We have an observation \(X=x\) and want to deduce the label \(Y\).

Tough to Optimize likelihood

We want to \(\max_w \log L(w|X=x) = \max_w \log \sum_y f_{X, Y|w}(x, y)\), but this expression often turns out to be hard to maximize due to non-convexity/ non-smoothness. Suppose that this is the case. Also suppose that \(f_{W|X, Y}\) is easy to maximize.

So, we resort to local optimization of a surrogate function starting from an initial guess of \(w\).

Examples

May be want to find parameter \(w\) giving weights to a set of fixed Gaussians. Here, \(Y\) can be vector of id’s of Gaussians whence observed data X comes from.

A more common and important application is in estimating HMM parameters.

Iterative algorithm

Suppose that you are given \(w^{(i)}\). We want to obtain \(w^{(i+1)}\) such that \(L(w^{(i+1)}) \geq L(w^{i})\).

Intuition

Basic idea is to do the following repeatedly: at point \(w^{(i)}\), to find a tractable and approximate surrogate \(Q(w| w^{(i)})\) for \(L(w|X)\), and maximize it to get a ‘better’ \(w^{(i+1)}\).

Consider \(Q(w| w^{(i)})\) from the E-step below. \(Q(w|w^{(i)})\) is the expectation wrt \(w^{(i)}\) over \(Y\) of the log likelihood of \(w\) given \((X, Y)\). This seems to be a reasonable substitute for \(L(w|X)\).

E-step

Take \ \(Q(w | w^{(i)}) = E_{y \distr w^{(i)}}[(\log f_{X, Y|w}(x, y))]\) to measure goodness of \(w\) in producing X and the current belief about Y.

M-step

Set \(w^{(i+1)} = argmax_w Q(w | w^{(i)})\).

Analysis

Maximizing an approximation of the likelihood

Instead, construct a function Q(w) which lower bounds \(\log L(w|X)\); then maximize it to get \(w^{(i+1)}\); repeat.

Q(w) is a lower bound

Q(w) a lower bound for \(\log L(w|x)\).

Proof

Regardless of how \(Y \distr w^{(i)}\) is distributed, \(Q(w) = E_y \log L(w|x, y) \leq \log L(w|x)\) because \(E_t \log t \leq \log \max_{t \in T} t \leq \log \sum_T t\).

Convergence

Q() lower bounds L(), but we cannot guarantee that the \(\max_w Q()\) does not lead us away from the local maximum. So, monotonic convergence is not guaranteed. \chk