01 Distribution models

Discrete L: Response probability: Discriminative models

Boolean valued functions

One can use boolean valued functions to get deterministic models of the form y=f(x). These functions are considered in the boolean functions survey and the computational learning theory survey.

Probability from regression models

Take any (continuous variable regression) model f:X[0,1]. Such a model can be interpreted as modeling the probability distribution fL.

Advantages of modeling probability

The classifier doesn’t care whether C1 is called class 1 or class 100. So, better than solving regression problem with y as the target.

Model numeric labels with regression models

One may use regression models together with an appropriate round-off function to model discrete numerical labels.

Dependence on choice of ran(Y)

For the same k-classification problem, different choices of Y corresponding to {Li} can yield different models classifiers. Ideally they should be independent of choice of labels. So, logistic regression preferred.

Eg: For binary classification problem, picking Li={±1} yields different model from picking Li={Nn1,Nn2}, which yields fisher’s linear discriminant!

y in 1 of k binary encoding format

Make matrix X with rows [1xiT]. Make Y with rows yiT. Want to find parameters W such that XWY. Can try minWXWYF2, get solution: (XXT)W^=XTY. But XW^ can have -ve numbers which approximate Y; so not very desirable technique

Logistic model

Got k-class classification problem. Want to model class probabilities or log odds and make classification decision.

Log linear model for class probabilities

i[1:k]:Pr(C=i|x)ewi0+wiTx. So, Pr(C=i|x)=ewi0+wiTxjewj0+wjTx.

\exclaim{But this is over parametrized}: The choice of w is constrained by the fact that specifying Pr(C=i|x)i=1:k1 completely specifies the probability distribution.

Equivalent form: model log odds

i[1:k1]:logPr(C=i|x)Pr(C=k|x)=wi0+wiTx.

Get: Pr(C=i|x)=ewi0+wiTx1+jkewj0+wjTx,Pr(C=k|x)=11+jkewi0+wiTx.

Same as the model described in previous subsubsection, with all Pr(C=i) scaled to ensure that Pr(C=i|x)ewi0+wiTxPr(C=k|x): done by ensuring that wk=0. Thus taking care of earlier overparametrization!

Symmetric notation

Let x(1,x),wi(wi0,wi). $$Pr(C=i|x)=ec{1,..m1}wcTxI[c=i]1+jkecwcTxI[c=j]$$

2-class case

For 2 class case, these are logistic sigmoid functions, thence the name.

Risk factors interpretation

Pr(Ci|x) is modeled as a sigmoid function which 0 as wiTx and 1 as wiTx. So, can consider wi as the vector of weights assigned to features {xj}. Sgn(wi) usually indicates type of correlation with Ci, but could be reversed in order to compensate for weightage given to other features. Eg: Ci could be ‘has heart disease’, and features may be liquor, fat and tobacco consumption levels.

As a linear discriminant

Consider the binary classification case. Here, logPr(C=1|x)1Pr(C=1|x)=w0+wTx. So, w0+wTx>0(Pr(c=1|x)>Pr(c=0|x))

Estimating parameters

Given observations (x(i),c(i)), find w to maxwPr(c(i)|x(i),w): maximum likelihood estimation.

Sparsity of model parameters

Sometimes, want w to be sparse or group sparse. In this case, for learning the parameters, lasso or group lasso is used.

Discrete L: Response probability: Generative models

Latent variable model

Assume that the parameter W=w actually generates lower dimensional L, and that observation set X is generated from L using some stochastic transformation which is independent of w.

L is called the latent variable.

Assume conditional independence of input variables

Aka Naive Bayes. Pr(L|ϕ(X))Pr(L)Pr(ϕ(X)|L)=Pr(L)iPr(ϕi(X)|L). Pr(ϕ(X)|L)=Pr(L)iPr(ϕi(X)|L) is the assumption. Model parameters Pr(ϕi(X)|L) and Pr(L) are estimated from the training set {(Xi,Li)}.

Co-clustering in a way recovers things lost due to the ‘independence of probability of occurrence of features’ assumption. \tbc

One can conceive of a version of this classifier for the case where L,ϕ(X) are continuous. \oprob

Linear separator in some feature space

The decision boundary can be specified by logPr(l1)+ilogPr(ϕi(x)|l1)=logPr(l2)+ilogPr(ϕi(x)|l2).

Apply the following mapping for variables: yi,d=I[ϕi(x)=d]; and create a new set of parameters: wi,d=logPr(ϕi(X)=d|l1)logPr(ϕi(X)=d|l2), and w0=logPr(l1)logPr(l2). Now, the decision boundary is just w0+wTy=0, which is a linear separator.

Success in practice.

Often works well in practice. Eg: In document classification.

Discriminative counterpart

Its discriminative counterpart is the class of all linear classifiers in a certain feature space, which corresponds to logistic regression. That, in general works better given a lot of samples.

Use exponential family models

Specification

For ran(Y)=±1: Let Pr(x|Y=i)exp(wi,ϕ(x)), and Pr(Y=1)=p.

So, the corresponding discriminative classifier is: Pr(y|x)=exp(log(p1p)+log(Z(w0)Z(w1))+w1w0,ϕ(x)), which is a linear classifier.

The corresponding discriminative classifier can be deduced directly using logistic regression.

Tree structure assumptions

In estimating, it is important to use the family of tree strucutred graphical models: We can’t tractably compute Z(w) otherwise. wi can be done efficiently by computing the spanning tree of a graph among nodes with edges weighted by mutual information (Chow Liu algorithm).

Otherwise, mixture of trees are also used.

Latent variable models: Expectation Maximization (EM) alg

Problem

We have an observation X=x and want to deduce the label Y.

Tough to Optimize likelihood

We want to maxwlogL(w|X=x)=maxwlogyfX,Y|w(x,y), but this expression often turns out to be hard to maximize due to non-convexity/ non-smoothness. Suppose that this is the case. Also suppose that fW|X,Y is easy to maximize.

So, we resort to local optimization of a surrogate function starting from an initial guess of w.

Examples

May be want to find parameter w giving weights to a set of fixed Gaussians. Here, Y can be vector of id’s of Gaussians whence observed data X comes from.

A more common and important application is in estimating HMM parameters.

Iterative algorithm

Suppose that you are given w(i). We want to obtain w(i+1) such that L(w(i+1))L(wi).

Intuition

Basic idea is to do the following repeatedly: at point w(i), to find a tractable and approximate surrogate Q(w|w(i)) for L(w|X), and maximize it to get a ‘better’ w(i+1).

Consider Q(w|w(i)) from the E-step below. Q(w|w(i)) is the expectation wrt w(i) over Y of the log likelihood of w given (X,Y). This seems to be a reasonable substitute for L(w|X).

E-step

Take \ Q(w|w(i))=Eyw(i)[(logfX,Y|w(x,y))] to measure goodness of w in producing X and the current belief about Y.

M-step

Set w(i+1)=argmaxwQ(w|w(i)).

Analysis

Maximizing an approximation of the likelihood

Instead, construct a function Q(w) which lower bounds logL(w|X); then maximize it to get w(i+1); repeat.

Q(w) is a lower bound

Q(w) a lower bound for logL(w|x).

Proof

Regardless of how Yw(i) is distributed, Q(w)=EylogL(w|x,y)logL(w|x) because EtlogtlogmaxtTtlogTt.

Convergence

Q() lower bounds L(), but we cannot guarantee that the maxwQ() does not lead us away from the local maximum. So, monotonic convergence is not guaranteed. \chk