Discrete L: Response probability: Discriminative models
Boolean valued functions
One can use boolean valued functions to get deterministic models of the form
Probability from regression models
Take any (continuous variable regression) model
Advantages of modeling probability
The classifier doesn’t care whether
Model numeric labels with regression models
One may use regression models together with an appropriate round-off function to model discrete numerical labels.
Dependence on choice of ran(Y)
For the same k-classification problem, different choices of Y corresponding to
Eg: For binary classification problem, picking
y in 1 of k binary encoding format
Make matrix X with rows
Logistic model
Got k-class classification problem. Want to model class probabilities or log odds and make classification decision.
Log linear model for class probabilities
\exclaim{But this is over parametrized}: The choice of w is constrained by the fact that specifying
Equivalent form: model log odds
Get:
Same as the model described in previous subsubsection, with all
Symmetric notation
Let
2-class case
For 2 class case, these are logistic sigmoid functions, thence the name.
Risk factors interpretation
As a linear discriminant
Consider the binary classification case. Here,
Estimating parameters
Given observations
Sparsity of model parameters
Sometimes, want w to be sparse or group sparse. In this case, for learning the parameters, lasso or group lasso is used.
Discrete L: Response probability: Generative models
Latent variable model
Assume that the parameter
Assume conditional independence of input variables
Aka Naive Bayes.
Co-clustering in a way recovers things lost due to the ‘independence of probability of occurrence of features’ assumption. \tbc
One can conceive of a version of this classifier for the case where
Linear separator in some feature space
The decision boundary can be specified by
Apply the following mapping for variables:
Success in practice.
Often works well in practice. Eg: In document classification.
Discriminative counterpart
Its discriminative counterpart is the class of all linear classifiers in a certain feature space, which corresponds to logistic regression. That, in general works better given a lot of samples.
Use exponential family models
Specification
For
So, the corresponding discriminative classifier is:
The corresponding discriminative classifier can be deduced directly using logistic regression.
Tree structure assumptions
In estimating, it is important to use the family of tree strucutred graphical models: We can’t tractably compute
Otherwise, mixture of trees are also used.
Latent variable models: Expectation Maximization (EM) alg
Problem
We have an observation
Tough to Optimize likelihood
We want to
So, we resort to local optimization of a surrogate function starting from an initial guess of
Examples
May be want to find parameter
A more common and important application is in estimating HMM parameters.
Iterative algorithm
Suppose that you are given
Intuition
Basic idea is to do the following repeatedly: at point
Consider
E-step
Take \
M-step
Set
Analysis
Maximizing an approximation of the likelihood
Instead, construct a function Q(w) which lower bounds
Q(w) is a lower bound
Q(w) a lower bound for
Proof
Regardless of how
Convergence
Q() lower bounds L(), but we cannot guarantee that the