01 Distribution models

Discrete L: Response probability: Discriminative models

Boolean valued functions

One can use boolean valued functions to get deterministic models of the form $y = f (x)$ . These functions are considered in the boolean functions survey and the computational learning theory survey.

Probability from regression models

Take any (continuous variable regression) model $f : X \to [0, 1]$ . Such a model can be interpreted as modeling the probability distribution $f_{L}$ .

Advantages of modeling probability

The classifier doesn’t care whether $C_{1}$ is called class 1 or class 100. So, better than solving regression problem with y as the target.

Model numeric labels with regression models

One may use regression models together with an appropriate round-off function to model discrete numerical labels.

Dependence on choice of ran(Y)

For the same k-classification problem, different choices of Y corresponding to ${L_{i}}$ can yield different models classifiers. Ideally they should be independent of choice of labels. So, logistic regression preferred.

Eg: For binary classification problem, picking $L_{i} = {\pm 1}$ yields different model from picking $L_{i} = {\frac{N}{n_{1}}, - \frac{N}{n_{2}}}$ , which yields fisher’s linear discriminant!

y in 1 of k binary encoding format

Make matrix X with rows $[1 x_{i}^{T}]$ . Make Y with rows $y_{i}^{T}$ . Want to find parameters W such that $X W \approx Y$ . Can try $min_{W} {∥ X W - Y ∥}_{F}^{2}$ , get solution: $(X X^{T}) \hat{W} = X^{T} Y$ . But $X \hat{W}$ can have -ve numbers which approximate Y; so not very desirable technique

Logistic model

Got k-class classification problem. Want to model class probabilities or log odds and make classification decision.

Log linear model for class probabilities

$\forall i \in [1 : k] : P r (C = i | x) \propto e^{w_{i 0} + w_{i}^{T} x}$ . So, $P r (C = i | x) = \frac{e^{w_{i 0} + w_{i}^{T} x}}{\sum_{j} e^{w_{j 0} + w_{j}^{T} x}}$ .

\exclaim{But this is over parametrized}: The choice of w is constrained by the fact that specifying $P r (C = i | x) \forall i = 1 : k - 1$ completely specifies the probability distribution.

Equivalent form: model log odds

$\forall i \in [1 : k - 1] : \log \frac{P r (C = i | x)}{P r (C = k | x)} = w_{i 0} + w_{i}^{T} x$ .

Get: $P r (C = i | x) = \frac{e^{w_{i 0} + w_{i}^{T} x}}{1 + \sum_{j \neq k} e^{w_{j 0} + w_{j}^{T} x}}, P r (C = k | x) = \frac{1}{1 + \sum_{j \neq k} e^{w_{i 0} + w_{i}^{T} x}}$ .

Same as the model described in previous subsubsection, with all $P r (C = i)$ scaled to ensure that $P r (C = i | x) \propto e^{w_{i 0} + w_{i}^{T} x} P r (C = k | x)$ : done by ensuring that $w_{k} = 0$ . Thus taking care of earlier overparametrization!

Symmetric notation

Let $x \leftarrow (1, x), w_{i} \leftarrow (w_{i 0}, w_{i})$ . $$ $ P r (C = i | x) = \frac{e^{\sum_{c \in {1, . . m - 1}} w_{c}^{T} x I [c = i]}}{1 + \sum_{j \neq k} e^{\sum_{c} w_{c}^{T} x I [c = j]}}$ $$

2-class case

For 2 class case, these are logistic sigmoid functions, thence the name.

Risk factors interpretation

$P r (C_{i} | x)$ is modeled as a sigmoid function which $\to 0$ as $w_{i}^{T} x \to - \infty$ and $\to 1$ as $w_{i}^{T} x \to \infty$ . So, can consider $w_{i}$ as the vector of weights assigned to features ${x_{j}}$ . $S g n (w_{i})$ usually indicates type of correlation with $C_{i}$ , but could be reversed in order to compensate for weightage given to other features. Eg: $C_{i}$ could be ‘has heart disease’, and features may be liquor, fat and tobacco consumption levels.

As a linear discriminant

Consider the binary classification case. Here, $\log \frac{P r (C = 1 | x)}{1 - P r (C = 1 | x)} = w_{0} + w^{T} x$ . So, $w_{0} + w^{T} x > 0 \equiv (P r (c = 1 | x) > P r (c = 0 | x))$

Estimating parameters

Given observations $(x^{(i)}, c^{(i)})$ , find w to $max_{w} P r (c^{(i)} | x^{(i)}, w)$ : maximum likelihood estimation.

Sparsity of model parameters

Sometimes, want w to be sparse or group sparse. In this case, for learning the parameters, lasso or group lasso is used.

Discrete L: Response probability: Generative models

Latent variable model

Assume that the parameter $W = w$ actually generates lower dimensional $L$ , and that observation set $X$ is generated from $L$ using some stochastic transformation which is independent of $w$ .

$L$ is called the latent variable.

Assume conditional independence of input variables

Aka Naive Bayes. $P r (L | ϕ (X)) \propto P r (L) P r (ϕ (X) | L) = P r (L) \prod_{i} P r (ϕ_{i} (X) | L)$ . $P r (ϕ (X) | L) = P r (L) \prod_{i} P r (ϕ_{i} (X) | L)$ is the assumption. Model parameters $P r (ϕ_{i} (X) | L)$ and $P r (L)$ are estimated from the training set ${(X_{i}, L_{i})}$ .

Co-clustering in a way recovers things lost due to the ‘independence of probability of occurrence of features’ assumption. \tbc

One can conceive of a version of this classifier for the case where $L, ϕ (X)$ are continuous. \oprob

Linear separator in some feature space

The decision boundary can be specified by $\log P r (l_{1}) + \sum_{i} \log P r (ϕ_{i} (x) | l_{1}) = \log P r (l_{2}) + \sum_{i} \log P r (ϕ_{i} (x) | l_{2})$ .

Apply the following mapping for variables: $y_{i, d} = I [ϕ_{i} (x) = d]$ ; and create a new set of parameters: $w_{i, d} = \log P r (ϕ_{i} (X) = d | l_{1}) - \log P r (ϕ_{i} (X) = d | l_{2})$ , and $w_{0} = \log P r (l_{1}) - \log P r (l_{2})$ . Now, the decision boundary is just $w_{0} + w^{T} y = 0$ , which is a linear separator.

Success in practice.

Often works well in practice. Eg: In document classification.

Discriminative counterpart

Its discriminative counterpart is the class of all linear classifiers in a certain feature space, which corresponds to logistic regression. That, in general works better given a lot of samples.

Use exponential family models

Specification

For $r a n (Y) = \pm 1$ : Let $P r (x | Y = i) \propto e x p (⟨ w_{i}, ϕ (x) ⟩)$ , and $P r (Y = 1) = p$ .

So, the corresponding discriminative classifier is: $P r (y | x) = e x p (\log (\frac{p}{1 - p}) + \log (\frac{Z (w_{0})}{Z (w_{1})}) + ⟨ w_{1} - w_{0}, ϕ (x) ⟩)$ , which is a linear classifier.

The corresponding discriminative classifier can be deduced directly using logistic regression.

Tree structure assumptions

In estimating, it is important to use the family of tree strucutred graphical models: We can’t tractably compute $Z (w)$ otherwise. $w_{i}$ can be done efficiently by computing the spanning tree of a graph among nodes with edges weighted by mutual information (Chow Liu algorithm).

Otherwise, mixture of trees are also used.

Latent variable models: Expectation Maximization (EM) alg

Problem

We have an observation $X = x$ and want to deduce the label $Y$ .

Tough to Optimize likelihood

We want to $max_{w} \log L (w | X = x) = max_{w} \log \sum_{y} f_{X, Y | w} (x, y)$ , but this expression often turns out to be hard to maximize due to non-convexity/ non-smoothness. Suppose that this is the case. Also suppose that $f_{W | X, Y}$ is easy to maximize.

So, we resort to local optimization of a surrogate function starting from an initial guess of $w$ .

Examples

May be want to find parameter $w$ giving weights to a set of fixed Gaussians. Here, $Y$ can be vector of id’s of Gaussians whence observed data X comes from.

A more common and important application is in estimating HMM parameters.

Iterative algorithm

Suppose that you are given $w^{(i)}$ . We want to obtain $w^{(i + 1)}$ such that $L (w^{(i + 1)}) \geq L (w^{i})$ .

Intuition

Basic idea is to do the following repeatedly: at point $w^{(i)}$ , to find a tractable and approximate surrogate $Q (w | w^{(i)})$ for $L (w | X)$ , and maximize it to get a ‘better’ $w^{(i + 1)}$ .

Consider $Q (w | w^{(i)})$ from the E-step below. $Q (w | w^{(i)})$ is the expectation wrt $w^{(i)}$ over $Y$ of the log likelihood of $w$ given $(X, Y)$ . This seems to be a reasonable substitute for $L (w | X)$ .

E-step

Take \ $Q (w | w^{(i)}) = E_{y \sim w^{(i)}} [(\log f_{X, Y | w} (x, y))]$ to measure goodness of $w$ in producing X and the current belief about Y.

M-step

Set $w^{(i + 1)} = a r g m a x_{w} Q (w | w^{(i)})$ .

Analysis

Maximizing an approximation of the likelihood

Instead, construct a function Q(w) which lower bounds $\log L (w | X)$ ; then maximize it to get $w^{(i + 1)}$ ; repeat.

Q(w) is a lower bound

Q(w) a lower bound for $\log L (w | x)$ .

Proof

Regardless of how $Y \sim w^{(i)}$ is distributed, $Q (w) = E_{y} \log L (w | x, y) \leq \log L (w | x)$ because $E_{t} \log t \leq \log max_{t \in T} t \leq \log \sum_{T} t$ .

Convergence

Q() lower bounds L(), but we cannot guarantee that the $max_{w} Q ()$ does not lead us away from the local maximum. So, monotonic convergence is not guaranteed. \chk