Graphical model G of distribution
The modeling problem
Got RV’s \(X = (X_{i})\), \(f_X(x)\): joint probability density. RV’s as nodes. Edges representing dependencies.
Distribution structure/ sparsity
Seek to represent some factorization of the joint probability distribution concisely, thence conditional independence relationships too. In many cases, these factors involve small subsets of variables: sparsity in the dependency graph.
Eg: \(f_X(x) = Z^{-1}\prod_{C \subseteq V} \gf_{C}(x_C)\). Compare notation with exponential family distributions.
Graphical model family
A graph alone describes conditional independence relationships which is satisfied by many distributions.
Uses
Any distribution can be represented by a (maybe complete) graphical model, but it becomes interesting only when the graph/ model is sparse.
Useful in representing causal relationships.
The factorization of Pr(x) lets ye store the joint probability distribution very concisely: usually ye would need \(ran(X_i)^{n}\) space.
Can do fast inference using graph theoretic algs.
Can characterize running time and inference error bound in terms of properties of the underlying graph.
Factor graphs
Factors of Pr(x)
Bipartite graph of shaded ovals (\(\set{i}\) for factors \(f_{i}(\nbd(i))\): any nonnegative fns) and ovals (RV’s \(\set{X_{i}}\)). \(f_X(x) = Z^{-1}\prod f_{i}(\nbd(i))\). This is a ‘hypergraph’ among \(\set{X_{i}}\), with generalized edges connecting 2 sets of variables.
Also defines another graph, \(\nbd\) relationship amongst \(\set{X_{i}}\); and thence ‘path’ is defined.
Connection with exponential families
Same as in the undirected model case.
Conditional independence
If every path between RV’s X, Y passes through Z, Z separates X, Y. If Z separates X, Y \(X\perp Y|Z\): Pf: See undirected model case.
So, can think of Z as an observed variable, Z blocks flow of information from X to Y. So, \(\nbd(X)\) is its Markov blanket.
Expressiveness
Can express any factorization. Eg: can design factor f(X, Y) to say that X and Y are \(\eps\) apart; so factors called compatibility functions.
Undirected graphical models
Aka Markov random field.
Factorization
\(f_X(x) = Z^{-1} \prod \gf_{C_j}(x_{C_{j}})\), where \(C_{j}\) are cliques of various sizes in G.
As an exponential distribution family
See section on exponential families.
Conditional independence properties
Aka Markov properties. Conditional independence properties, markov blanket same as in Factor graphs.
Global Markov
Take any A, B, Z. If Z separates A and B, \(A \perp B | Z\).
Proof
Factorization implies this. Take A, B, Z; expand A and B to get A’, B’ which include all nodes reachable from A and B without crossing Z; So, \(f_X(x) = f(x_{A’}, x_Z)f(x_{B’}, x_Z)\); so \(A’ \perp B’ |Z\).
Local Markov
\(X_i \perp X_{V - i - N(i)} | X_{N(i)}\). Global markov implies this.
Pairwise Markov
If \((i, j) \notin E\), \(X_i \perp X_j | X_{V-\set{i, j}} \). Implied by Global Markov. Local Markov implies this.
Factorization from pairwise markov for many Pr(x)
(Hammersley Clifford) If \(\forall x: f_X(x) > 0\) pairwise markov implies factorization.
Tree structured case
Importance
It is easy to compute the partition function for this case. There exist efficient algorithms to do inference accurately on such models, and there are efficient algorithms to find the closest tree structred graphical model to any distribution.
Form and connections
\(f_X(x) \propto \prod_{(i, j) \in T} \gf_{i,j}(x_i, x_j)\).
As directed model
Now, consider any node, say \(x_1\), to be the root of the tree. \(f_X(x_1, x_{\nbd(1)}) \propto \prod_{j \in \nbd(1)} \gf_{1, j}(x_1, x_j)\). But, \(f_X(x_1, x_{\nbd(1)}) = f_{X_1}(x_1) \prod_{j \in \nbd(1)} f_{X_j|X_1}(x_j|x_1)\) from the conditional independence property of undirected graphical models. Applying this procedure recursively, one gets a directed graphical model.
In terms of marginals
Consider the corresponding directed model. \ \(f_{X_1,\nbd(1)}(x_1, x_{\nbd(1)}) = f_{X_1}(x_1) \prod_{j \in \nbd(1)} f_{X_j|X_1 = x_1}(x_j) \= f_{X_1}(x_1) \prod_{j \in \nbd(1)} f_{X_j}(x_j) \prod_{j \in \nbd(1)} \frac{f_{X_j, X_1}(x_j x_1)}{f_{X_j}(x_j)f_{X_1}(x_1)}\).
Applying this procedure repeatedly, we get: \(f_X(x) = \prod f_{X_i}(x_i) \prod_{(i,j) \in E} \frac{f_{X_i, X_j}(x_i, x_j)}{f_{X_i}(x_i) f_{X_j}(x_j)}\).
Pairwise graphical model
A subclass. \(f_X(x) \propto \prod_i \gf_i(x_i)\prod_{(i,j) \in E} \gf_{i, j}(x_i, x_j)\).
Hierarchical models
A factor \(\gf_c(x_c)\) exists only if, for all \(s \subset c\), a factor \(\gf_s(x_s)\) exists.
Discrete models
\(\set{dom(X_i)}\) are discrete.
Pairwise-ification of discrete models
Can add some extra variables, \ rewrite with all \(C_j\) being pairwise: If cliques of size \(p’>2\) exist, collapse that clique into a single node, expand the state space.
Note that, just because you know how to learn pairwise graphical models, you cannot simply construct a general discrete model learning algorithm: you don’t know which nodes to collapse.
The general form
Consider exponential family attached to discrete graphical model G of n vars. Let \(|dom(X_i)| = |M| = m\). Can assume G is pairwise.
We can completely specify \(\ftr_{i,j}(x_i, x_j)\) by parameter matrix \(T_{i, j}\) with \ \(T_{i,j, k, l} = \ftr_{i,j}(k, l)\). \( $$f_X(x) = \propto e^{\sum_{(i, j) \in E} T_{i,j, x_i, x_j} } = e^{\sum_{(i, j) \in V^{2}} \sum_{k, l \in M^{2}}T_{i,j, k, l} I[x_i = k] I[x_j = l]}\)$$. We can think of this as an exponential family distribution involving \(|V|^{2}m^{2}\) auxiliary features/ covariates \(y_{ij} = I[x_i = k] I[x_j = l]\). But this distribution \(f_X(x)\) is now overparametrized, as \(y_{i,j}\) are not linearly independent. [See section on minimal parametrization of exponential family distributions.]
Let \(M’ = M-\set{m}\). Using a minimal parametrization, we get an exponential family distribution involving only features \(y_{ij;kl \in (M’)^{2}} = I[x_i = k] I[x_j = l]\) and \(y_{i;k \in (M’)} = I[x_i = k]\). So, \(Pr(X = x \in M’) \propto exp(\sum t_{i;k} y_{i;k} + \sum t_{ij;kl} y_{ij;kl})\).
Ising model
\(Pr(X = x|t) \propto e^{\sum_i t_i x_i + \sum_{(i, j) \in V^{2}} t_{i,j} x_i x_j}; dom(X_i) \in \pm 1\). Any binary undirected graphical model involving variables \(Y_i\) with range \(\set{1, 2}\) can be expressed like this: just consider the minimal parametrization of such distribution using the auxiliary features described earlier.
For signed edge recovery for the class of Ising models given a few observations, see structure learning part in statistics ref. Originally used in physics to model electron spins’ interactions in the case of magnetism.
Junction tree model
Take an undirected graph G, find a junction tree T for it (see graph theory ref). Belief propagation algorithms work well over trees; hence this. Like a factor graph, there are 2 types of nodes: a set of nodes for cliques C in G. They are connected to each other through separators S.
Factorization
\(f_X(x) = \frac{\prod_{c \in C}f_{X_c}(x_c)}{\prod_{s \in S} f_{X_s}(x_s)^{|\nbd(s)| -1}}\).
Directed
Aka Bayesian networks; but needn’t be learned using Bayesian methods.
Extra notation
Shorthand for N nodes with identical parentage: a plate annotated by N, with a single node inside. Can represent deterministic variables with solid dots. Can represent observed variables as shaded nodes.
Factorization
Every \(X_i\) annotated with \(f_{X_i|par(X_i)}\) (aka factors).\ \(f_X = \prod_{i} f_{X_{i}|par(X_{i})} = \prod_{i} f(X_{i}, par(X_{i}))\).
There are many bayesian networks to represent \(f_X\) based on different decompositions: eg: \(X_1 = X_2 + X_3\). Not all are equally concise. Concise when expressing causal relationships.
Undirected ‘moralized’ graph
Make all edges undirected, but ‘marry off’ all unmarried parents: make cliques involving child and parents. These graphs are equivalent in terms of conditional independence.
Marginal independence
If X, Y don’t have a common ancestor: \(X \perp Y\). But, conditional independence, \(X \perp Y |E\) need not hold if E has a common child of X and Y; Eg: \(X_1 = X_2 + X_3\).
Dependency seperation of X, Y by Z
Aka d-separation. Every undirected path (X, Y) blocked by \(W\in Z\). 2 types of blocking: \(\to W \gets\): W not given; \(\to W \to\) or \(\gets W \to\): W given.
d-separation is graph-independent: Even when multiple graphs model same distribution, the conditional independence relationship deduced from any of them hold.
Global Markov property
Let A, B, Z be sets of variables. \(A \perp B |Z\) for all d-separating Z. Thence, markov boundary of X is \ \(\set{par(A), chi(A), par(chi(A))}\). Implied by factorization.
Also, if S separates A, B in moralized graph, \(A \perp B | S\). But, when S does not separate A, B: look at subgraph of A, B, S.
Check d-separation
Use breadth-first-search to find unblocked paths. Aka Bayes ball algorithm.
Other conditional independence properties
Local markov property
desc(i) \dfn descendents of i. \(X_i \perp X_{j \notin desc(i)}| par(i)\).
Pairwise markov property
\(X_i \perp X_j | X_{\nbd(i) - j}\).
Connections
Factorization \(\equiv\) Global Markov \(\equiv\) Local Markov \(\implies\) Pairwise markov. If \(f_X(x)\) has full support, pairwise markov \(\implies\) local markov.
Marginalized DAG
Let G = (V, E) be the DAG corresponding to Pr(x). The DAG corresponding to \(f_{X_{V-A}}(x_{V-A})\) is obtained as follows: Take subgraph S in G induced by (V - A). For every \((u, v) \in (V-A)^{2}\), add a new edge if \(\exists\) a directed path (u, s, v) in G, such that s is a sequence of vertices in A. Proof: Using factorization.
Comparison
Expressiveness
Take rain, sprinkler, grass wet (R, S, G) causal model. \(R \perp S\) but \(R \nvdash S|G\): ‘Explaining away’ phenomenon. Can’t express this with other models.
Take rectangle shaped undirected graph. Can’t make equivalent directed graph.
Undirected graphical models are better at expressing non-causal, soft relationships amongst RV’s. Directed models are usually very intuitive to construct.
Undirected models less expressive than factor graphs. \why
Structural equivalence
Tree structured undirected graphical model can be expressed as a directed graphical model with the same structure: this is detailed in the undirected graphical models section.
The reverse is not true: as seen from the rain, sprinkler, grass wet example. But if edges in the DAG do not meet, a tree structured directed graphical model can be expressed as an undirected graphical model.
Independence relationships amongst vars
Conditional independence easier to determine in undirected models compared with directed graphical models. But marginal independence easier to determine in the former.
Inference, decoding using Graphical model
See statistics survey for the following: structure learning (learn graph from data); parameter learning (learn parameters given graph).
Problems
Consider some \(S \subseteq V\).
Inference problems
Find marginals \(f_{X_S}\), or the partition function \(Z\), or find \(E[g(x)]\), when \(g(x)\) factors nicely according to the graphical model.
Decoding problem
Find component \(\hat{x}_S\) of the mode \(\hat{x} = \argmax_x f_X\).
Global maximum vs marginal maxima
Note that this is different from the marginal maximum \(\argmax_{x_S} f_{X_{S}}\) which may be found by solving the inference problem to find \(f_{X_S}\) and then taking its maximum.
One cannot simply find the local/ marginal maxima \(\argmax_{x_S} f_{X_{S}}\) and use it to find the global maximum.
Evidence
Maybe you want to solve these problems when values for some variables may be fixed: Eg: \(X_T = x_T\).
Solving for all variables
Another variation to the inference and decoding problems is to solve them for all sets of the form \(S = \set{i \in V}\).
Factorization and graph-based computations
Benefit of factorization
Inference problems involve summation over a subset \(ran(X)\), while decoding problem involve finding the maximum over it.
Suppose that \(f_X(x) = Z^{-1}\prod_{c \subseteq V} \gf_{c}(x_c)\). Distributive law is the key to summing/ maxing this function efficiently. Eg: see elimination algorithm, junction tree algorithm.
Elimination ordering
Order the factors to get \(f(x) = \gf_{c_1}(x) .. \). Now, if you have variables \(X_{T}\) involved only in factors \(\geq i\), you get: \(\sum_{x_{V-S}} f_X(x) = \sum_{x_{V-S - x_{T}}}\gf_{c_S}(x) .. \sum_{x_{T}}\gf_{c_i}(x)..\). An identical ordering is useful if we were doing \(\argmax_{x_{V-S}} f(x)\) instead.
This yields us the following reduction in dimensionality.
\subparagraph*{Reduction in dimensionality} Suppose that \(\max_i ran(X_i) = D\). Without the elimination ordering, we would have had to consider a set of \(D^{|V-S|}\) values during summing/ maxing. As a result of using the elimination ordering, we now consider a set of \(D^{|V-S-T|} + D^{|T|}\) values to do the same.
Thus, using this trick repeatedly, suppose that we find the elimination ordering \(f_X(x) = \prod_{c \in p(V)} \gf_c(x_c)\) where \(p(V)\) is a partition of \(V\). Then, we will be only be summing/ maxing over \(|p(V)|D^{\max_{c \in p(V)} |c|} = O(D^{\max_{c \in p(V)} |c|})\) values.
Finding the right order
Not all orders are equally good. There is a natural way to get this ordering for trees: consider the elimination algorithm.
Graph traversal view
Try to model the problem as one of making special graph traversals. Try to use local computations to replace global computations. Can think of this as nodes passing messages to each other.
Belief propagation
The Bottom-up idea
Exploit factorization and the elimination ordering we can solve the problem bottom-up.
If you are finding \(\argmax_x f_X(x)\), this is the max-product algorithm, if finding \(f_{X_1}(x_1)\), it is the sum product algorithm.
The idea: take local belief, take max or sum, propagate it to other nodes which need this to calculate their belief.
Node Elimination algorithm: Undirected Trees
Remove nodes to sum out/ max out one by one. Suppose want to find\ \(\argmax f_{X_{V-1}|X_1 = x_1}(x_{V-1})\). Then, root the tree at \(x_1\), and do the following.
Message, a definition of a function, every node \(X_j\) tells its parent \(X_i\): \ \(m_{j \to i}(x_i) = \max_j \gf_{i,j}(x_i, x_j) \prod_{k \neq i, (j,k) \in E} m_{k \to j}(x_j)\). This is the message passing implementation. Belief of \(x_1\): \(b_1(x_1) = \prod_{(j,1) \in E} m_{k \to 1}(x_1)\).
Can use similar algorithm to find marginal \(f_{X_1}(x_1)\).
Using known values
Suppose \(X_2 = x_2\) is fixed in the above process. Nothing changes in the algorithm itself - only \(X_2\) is thought of having only one value in its range while being summed over/ maxed over etc..
Finding conditional marginals
Suppose you want to find \ \(\argmax Pr(x_{V-1}|x_1, x_2)\). Root the tree at say \(x_1\), execute the algorithm as usual; when you encounter factors involving \(x_2\), don’t sum/ max over \(x_2\).
Reusing messages: Undirected Tree
Maybe you want to find \(f_{X_i}(x_i) \forall i\), and want to reuse messages (computations involving summing out). Then, simply use these update rules: \(m_{j \to i}(x_i) = \max_j \gf_{i,j}(x_i, x_j) \prod_{k \neq i, (j,k) \in E} m_{k \to j}(x_j) \forall i, j\).
The algorithm is naturally distributed - so scales well with number of nodes. Also, there is no need to compute an elimination ordering.
Feasibility
Each message depends on certain other messages, and it is computed when these messages are available. This is always possible for trees as there are no cyclical dependencies, and all messages are computed eventually.
Tree Factor graphs
Want to find \(f_{X_i}(x_i) \forall i\), given tree structured factor graph. Root it at \(x_1\), for every variable v and factor f, use the update rules: \ \(m_{v \to f}(x_v) = \prod_{f’ \neq f}m_{f’ \to v}(x_v), m_{f \to v}(x_v) = \sum{v’ \neq v} f(x) \prod_{v’ \neq v} m_{v’ \to v}(x_{v’})\). Belief \(b_v(x_v) = \prod_{f}m_{f \to v}(x_v)\).
General undirected graphs
Use junction trees
Maybe you don’t have a tree, but a graph for which you can get a junction tree. If the graph does not have one, can always convert it to a chordal graph by adding edges: but finding the minimal chordal supergraph is NP hard.
Belief propagation proceeds as in tree factor graphs, except the clique nodes play the role of variables, and the separators/ nodes representing variables shared between cliques play the role of factors.
Belief for each clique C is thus easily calculated; by induction over number of cliques in the clique tree, \(b_c(x_c) = \sum_{V - c} f_X(x)\), as expected.
Finding the belief for each variable depends exponentially on tree width: must consider \(dom(X_i)^{|C|}\) values while summing/ maxing.
Tree reweighted max product
Take \(p(x) \propto g_1(x)g_2(x)\), such that every clique involved in p(x) is either in \(g_1\) or in \(g_2\). Then, if \(x^{} \in \argmax g_1(x) \land x^{} \in \argmax g_2(x), x \in \argmax p(x)\). So, can find smart ways of splitting p; then maximize each g; if the intersection of the maximal points is not empty, then done; otherwise, move around edge-mass.
Approximate inference: Loopy belief propagation
Trouble because of loops
Suppose there were loops, you can try initializing all incoming messages at all nodes with 1, applying update rule at each node repeatedly. Each node calculates \(m_{i \to j}^{(t+1)} = \gf_{i,j}(x_i, x_j)\prod_{k \neq j}m_{k \to i})^{(t)}\). Also, as messages (a function output vector \(m(x_i)\)) may keep growing bigger; may need to normalize each message at each iteration.
Applicability
There are almost no theoretical guarantees of convergence or correctness; but widely applied in practice. But, when applied to trees, it is consistent with usual belief propagation, and yields the right answer.
An example in case of the inference problem involved in decoding binary linear codes is given in the information/ coding theory survey.
Computation tree at iteration j wrt node i
A way to visualize the undirected graph, as it looks to node i at iteration j, while it calculates the belief \(b_i^{(j)}(x_i)\) during loopy belief propagation. For j=0, the tree is just the single node i. For every j, you add a level to the tree, indicating new messages which are considered in \(b_i^{(j)}(x_i)\) - the new leaves attached to a node k are \(\nbd(k) - par(i)\). Each tree is a valid graphical model in itself.
Eg: consider triangle 1, 2, 3. Initially, tree is 1. Then new level (2, 3) is added. Then children 3’, 1’ are added to 2; and 1’, 2’ are added to 3. These copies of nodes are conceptually different from the original: the messages they send are different.
‘Correctness in case of steady state’ results
This is useful because: maybe loopy belief propagation will be in a steady state, before there is a loop in the computation tree - so highly dependent on the initialization!
\subparagraph*{Damped max product} Use control theory idea to force oscillating system towards a steady state. Each node actually uses message \ \(m_{i \to j}’^{(t+1)} = m_{i \to j}^{(t)l}m_{i \to j}^{(t+1)(1-l)}\).
Max-product on single cycle
But, if you are doing max-product on a graph which is a single cycle, and if you hit a steady state for all messages, then the computation yields the right answer to \(\argmax_x f_{|X_1 = x_1}(x\). This also holds for graphs which are trees, single cycles or single cycled trees.
Proof idea: Consider the computation tree T wrt \(x_1\) at the steady state. Then, belief computed at \(x_1\) corresponds to \(\argmax_x f_{X \distr T}(x_{V-1}|x_1)\), the max product belief correct for this tree. But, see that \ \(Pr_{T}(x_{V-1}|x_1) = t Pr(x_{V-1}|x_1)^{k}\) for some k, t. Thence relate argmax \(Pr_{T}\) with argmax \(Pr\).
For Gaussian graphical models
Maybe given normal distribution of in the form \(f_X(x) \propto e^{-2^{-1}x^{T}Px + hx}\) by specifying P and \(h = P \mean\), where P is the precision matrix. For every \(P_{i,j} \neq 0\), there is an edge in the model graph G.
Max product finds the mean; sum product finds the marginal: either case reduces to finding the mean \(\mean\); so they correspond to executing the same algorithm. Marginalizing or maxing over a gaussian distribution yields another gaussian \(e^{-2^{-1}x^{T}P’x + h’x}\), so the messages passed during message passing algorithm correspond to the parameters of this expression.
Connection with solving Ax = b
Essentially solving \(P\mean = h\) for \(\mean\); perhaps loopy belief propagation can be used to solve Ax = b for very large \(A \succeq 0\). Convergence happens only if P is diagonally dominant.
If G is a tree, then this corresponds to Gaussian elimination.
Directed graphical models
One can simply convert directed graphical models to equivalent undirected models and use inference algorithms described for them.