04 Affinity modeling

Problem

One wants to probabilitically model ‘affinities’ (joint, conditional probabilities) of entities of two or more types. Entity types are modeled by discrete random variables (say W and D).

Motivation

Besides common motivations for modeling joint distributions of random variables, one may want to model affinities probabilistically in order to get low dimensional representations of one or both of these entities (motivations for which are described in the dimensionality reduction chapter of the statistics survey).

Non probabilistic ways

These are considered in the latent factor analysis section in the dimensionality reduction chapter of the statistics survey.

Eg: Latent Semantic Analysis (LSA), aka Latent Semantic Indexing (LSI): Use SVD to get factors for documents and words.

pLSA

Probabilistic LSA.

Aspect model

Each document is a convex combination/ mixture of topics, each topic defines a distribution over words; each word is drawn from this mixture of distributions. Pr(w|d)=tPr(t|d)Pr(w|t). So, Pr(w,d)=Pr(d)tPr(t|d)Pr(w|t)=Pr(t)tPr(d|t)Pr(w|t) : observe 2 factorizations.

Unknown environment 'figure'

Modeling assumptions

Bag of words assumption: given topic, words are chosen independently. Conditional independence: Given a mixture of topics (d), w1|tw2|t.

Dimensionality reduction

Each document, which was earlier a vector in the vocabulary space, is now a vector in the topic space.

Defects

Unclear how to assign probability to unseen item.

Latent Dirichlet Allocation (LDA)

Attempt to model observed bags of words at the corpus level. Look upon documents in corpus as having been generated by a process parametrized by corpus-level constant a. Also, add corpus level constant parameter b as extra parameter for generating words, given a topic.

\begin{figure}[!htb] \begin{tikzpicture}[node distance=1cm,>=stealth’,bend angle=15,auto] \node (a)[gm-var-constant]{a}; \node (D)[gm-var-hidden, right of = a]{D}; \node (T)[gm-var-hidden, right of = D]{T}; \node (W)[gm-var-seen, right of = T]{W}; \node (b)[gm-var-constant, right of = W]{b} edge [->] (W); \path [->] (a) edge (D) (D) edge (T) (T) edge (W);

\node[gm-plate] (words) [fit = (T) (W)] {};

\node[gm-plate] (documents) [fit = (words) (D)] {};

\end{tikzpicture} \end{figure}