01 Distribution models

Discrete L: Response probability: Discriminative models

Boolean valued functions

One can use boolean valued functions to get deterministic models of the form $y = f(x)$. These functions are considered in the boolean functions survey and the computational learning theory survey.

Probability from regression models

Take any (continuous variable regression) model $f: X \to [0, 1]$. Such a model can be interpreted as modeling the probability distribution $f_L$.

Advantages of modeling probability

The classifier doesn’t care whether $C_{1}$ is called class 1 or class 100. So, better than solving regression problem with y as the target.

Model numeric labels with regression models

One may use regression models together with an appropriate round-off function to model discrete numerical labels.

Dependence on choice of ran(Y)

For the same k-classification problem, different choices of Y corresponding to $\set{L_{i}}$ can yield different models classifiers. Ideally they should be independent of choice of labels. So, logistic regression preferred.

Eg: For binary classification problem, picking $L_{i} = \set{\pm 1}$ yields different model from picking $L_{i} = \set{\frac{N}{n_{1}}, - \frac{N}{n_{2}}}$, which yields fisher’s linear discriminant!

y in 1 of k binary encoding format

Make matrix X with rows $[1 x_{i}^{T}]$. Make Y with rows $y_{i}^{T}$. Want to find parameters W such that $XW \approx Y$. Can try $\min_{W} \norm{XW-Y}_{F}^{2}$, get solution: $(XX^{T})\hat{W} = X^{T}Y$. But $X\hat{W}$ can have -ve numbers which approximate Y; so not very desirable technique

Logistic model

Got k-class classification problem. Want to model class probabilities or log odds and make classification decision.

Log linear model for class probabilities

$\forall i \in [1:k]: Pr(C = i|x) \propto e^{w_{i0} + w_i^{T}x}$. So, $Pr(C = i|x) = \frac{e^{w_{i0} + w_i^{T}x}}{\sum_j e^{w_{j0} + w_j^{T}x}}$.

\exclaim{But this is over parametrized}: The choice of w is constrained by the fact that specifying $Pr(C = i|x) \forall i=1:k-1$ completely specifies the probability distribution.

Equivalent form: model log odds

$\forall i\in [1:k-1]: \log\frac{Pr(C = i|x)}{Pr(C = k|x)} = w_{i0} + w_{i}^{T}x$.

Get: $Pr(C = i|x) = \frac{e^{w_{i0}+ w_{i}^{T}x}}{1 + \sum_{j \neq k} e^{w_{j0}+ w_{j}^{T}x}}, Pr(C = k|x) = \frac{1}{1 + \sum_{j\neq k} e^{w_{i0}+ w_{i}^{T}x}}$.

Same as the model described in previous subsubsection, with all $Pr(C = i)$ scaled to ensure that $Pr(C = i|x) \propto e^{w_{i0} + w_i^{T}x} Pr(C = k|x)$: done by ensuring that $w_k = 0$. Thus taking care of earlier overparametrization!

Symmetric notation

Let $x \gets (1, x), w_i \gets (w_{i0}, w_i)$. $ $$Pr(C = i|x) = \frac{e^{\sum_{c \in \set{1, .. m-1}} w_{c}^{T}x I[c=i]}}{1 + \sum_{j \neq k} e^{\sum_c w_{c}^{T}x I[c=j]}}$$$

2-class case

For 2 class case, these are logistic sigmoid functions, thence the name.

Risk factors interpretation

$Pr(C_{i}|x)$ is modeled as a sigmoid function which $\to 0$ as $w_{i}^{T}x \to -\infty$ and $\to 1$ as $w_{i}^{T}x \to \infty$. So, can consider $w_{i}$ as the vector of weights assigned to features $\set{x_{j}}$. $Sgn(w_{i})$ usually indicates type of correlation with $C_{i}$, but could be reversed in order to compensate for weightage given to other features. Eg: $C_{i}$ could be ‘has heart disease’, and features may be liquor, fat and tobacco consumption levels.

As a linear discriminant

Consider the binary classification case. Here, $\log \frac{Pr(C = 1|x)}{1 - Pr(C = 1|x)} = w_0 + w^{T}x$. So, $w_0 + w^{T}x>0 \equiv (Pr(c=1|x) > Pr(c=0|x))$

Estimating parameters

Given observations $(x^{(i)}, c^{(i)})$, find w to $\max_w Pr(c^{(i)}|x^{(i)}, w)$: maximum likelihood estimation.

Sparsity of model parameters

Sometimes, want w to be sparse or group sparse. In this case, for learning the parameters, lasso or group lasso is used.

Discrete L: Response probability: Generative models

Latent variable model

Assume that the parameter $W = w$ actually generates lower dimensional $L$, and that observation set $X$ is generated from $L$ using some stochastic transformation which is independent of $w$.

$L$ is called the latent variable.

Assume conditional independence of input variables

Aka Naive Bayes. $Pr(L|\ftr(X)) \propto Pr(L) Pr(\ftr(X)|L) = Pr(L) \prod_{i} Pr(\ftr_{i}(X)|L) $. $Pr(\ftr(X)|L) = Pr(L) \prod_{i} Pr(\ftr_{i}(X)|L)$ is the assumption. Model parameters $Pr(\ftr_{i}(X)|L)$ and $Pr(L)$ are estimated from the training set $\set{(X_{i}, L_{i})}$.

Co-clustering in a way recovers things lost due to the ‘independence of probability of occurrence of features’ assumption. \tbc

One can conceive of a version of this classifier for the case where $L, \ftr(X)$ are continuous. \oprob

Linear separator in some feature space

The decision boundary can be specified by $\log Pr(l_1) + \sum_{i} \log Pr(\ftr_{i}(x)|l_1) = \log Pr(l_2) + \sum_{i} \log Pr(\ftr_{i}(x)|l_2)$.

Apply the following mapping for variables: $y_{i,d} = I[\ftr_i(x) = d]$; and create a new set of parameters: $w_{i, d} = \log Pr(\ftr_{i}(X) = d|l_1) - \log Pr(\ftr_{i}(X) = d|l_2)$, and $w_0 = \log Pr(l_1) - \log Pr(l_2)$. Now, the decision boundary is just $w_0 + w^{T}y = 0$, which is a linear separator.

Success in practice.

Often works well in practice. Eg: In document classification.

Discriminative counterpart

Its discriminative counterpart is the class of all linear classifiers in a certain feature space, which corresponds to logistic regression. That, in general works better given a lot of samples.

Use exponential family models

Specification

For $ran(Y) = \pm1$: Let $Pr(x|Y=i) \propto exp(\dprod{w_i, \ftr(x)})$, and $Pr(Y=1) = p$.

So, the corresponding discriminative classifier is: $Pr(y|x) = exp(\log(\frac{p}{1-p}) + \log(\frac{Z(w_0)}{Z(w_1)}) + \dprod{w_1 - w_0, \ftr(x)})$, which is a linear classifier.

The corresponding discriminative classifier can be deduced directly using logistic regression.

Tree structure assumptions

In estimating, it is important to use the family of tree strucutred graphical models: We can’t tractably compute $Z(w)$ otherwise. $w_i$ can be done efficiently by computing the spanning tree of a graph among nodes with edges weighted by mutual information (Chow Liu algorithm).

Otherwise, mixture of trees are also used.

Latent variable models: Expectation Maximization (EM) alg

Problem

We have an observation $X=x$ and want to deduce the label $Y$.

Tough to Optimize likelihood

We want to $\max_w \log L(w|X=x) = \max_w \log \sum_y f_{X, Y|w}(x, y)$, but this expression often turns out to be hard to maximize due to non-convexity/ non-smoothness. Suppose that this is the case. Also suppose that $f_{W|X, Y}$ is easy to maximize.

So, we resort to local optimization of a surrogate function starting from an initial guess of $w$.

Examples

May be want to find parameter $w$ giving weights to a set of fixed Gaussians. Here, $Y$ can be vector of id’s of Gaussians whence observed data X comes from.

A more common and important application is in estimating HMM parameters.

Iterative algorithm

Suppose that you are given $w^{(i)}$. We want to obtain $w^{(i+1)}$ such that $L(w^{(i+1)}) \geq L(w^{i})$.

Intuition

Basic idea is to do the following repeatedly: at point $w^{(i)}$, to find a tractable and approximate surrogate $Q(w| w^{(i)})$ for $L(w|X)$, and maximize it to get a ‘better’ $w^{(i+1)}$.

Consider $Q(w| w^{(i)})$ from the E-step below. $Q(w|w^{(i)})$ is the expectation wrt $w^{(i)}$ over $Y$ of the log likelihood of $w$ given $(X, Y)$. This seems to be a reasonable substitute for $L(w|X)$.

E-step

Take \ $Q(w | w^{(i)}) = E_{y \distr w^{(i)}}[(\log f_{X, Y|w}(x, y))]$ to measure goodness of $w$ in producing X and the current belief about Y.

M-step

Set $w^{(i+1)} = argmax_w Q(w | w^{(i)})$.

Analysis

Maximizing an approximation of the likelihood

Instead, construct a function Q(w) which lower bounds $\log L(w|X)$; then maximize it to get $w^{(i+1)}$; repeat.

Q(w) is a lower bound

Q(w) a lower bound for $\log L(w|x)$.

Proof

Regardless of how $Y \distr w^{(i)}$ is distributed, $Q(w) = E_y \log L(w|x, y) \leq \log L(w|x)$ because $E_t \log t \leq \log \max_{t \in T} t \leq \log \sum_T t$.

Convergence

Q() lower bounds L(), but we cannot guarantee that the $\max_w Q()$ does not lead us away from the local maximum. So, monotonic convergence is not guaranteed. \chk