Originally, AI researchers didn't like use probability for dealing with uncertainty. This has changed recently, and probability theory is now the most common approach taken with uncertainty in AI, with Bayesian networks being the most popular implementation. The main problem with probability in AI is that most of the problems are NP-hard.
Probability
Probability is defined in one of two ways. It is either the degree of belief in a statistic, or the frequency of occurance in an infinite population.
The Yule-Simpson paradox basically states how breaking down a set of figures can show a very different picture than it first seems from looking at a high level overview.
If we consider a table of fictitious data, then we have columns of variables (in this case which has two values), a field header, and for each possible combination, a count. A less compact possibility is to list each combination for each time it occurs.
Such a table is called a discrete contingency table. This table is really a flattened version of an n-dimensional object: one dimension for each variable For continuous data, contingency tables are not suitable and new methods are needed. Unfortunately, there is not enough time in AGM to cover these techniques.
A contingency table tells us what has been observed in the past; it contains data. We need to move from this data to probability. A simple way to create this probability distribution from the data is to find the sum of all counts, and divide by the count for each cell in the total. This probability distribution results in what is effectively a prediction of what's likely in the future, and this specific method produces a joint probability distribution.
It is a distribution because a probability mass of 1 has been distributed over the 8 cells. What makes it a probability distribution is because each individual number is a probability, and the joint probability distribution comes from each probability corresponding to a joint instantiation of the 3 variables.
This is also an example of maximum likelihood estimation. It is important to note that it is only an estimation since the distribution it produces is an estimate of some unknown true distribution.
Putting aside the joint structure of our distribution, our maximum likelihood estimation distribution defines a probability distribution with 8 possible outcomes, like throwing a biased 8-sided die. Given a fixed data size, it defines a multinomial distribution over all the possible datasets of that size. Adopting a multinomial distribution is tantamount to assuming that each datapoint is independently 'drawn' from our probability distribution.
That is, our multinomial distribution is the probability of that particular observed data set happening. The data set that has the highest multinomial distribution is the most likely set to occur within that probabilistic model. It should be the same as or very close to the observed data. To figure out the highest possible data set, we could use calculus, but plugging in the numbers from the observed dataset generally works just as well.
A probabilistic model imposes structural constraints on what the true probability distribution, and a graphical model is just one type of probabilistic model. Formally, a model is just a set of probability distributions.
We can think of probabilistic models as qualitative structural constraints, where in a graphical model, fewer edges mean more constraints, where all nodes connected mean there are no restrictions, and no nodes are independent.
A saturated model is one where there are no constraints, therefore formally it is the set of all possible probability distributions for a given collection of variables.
For a saturated model, we can compute the MLE as follows. Let there are n datapoints, where n(i) fall into cell i, then the MLE distribution is defined by a probability for each cell. Let p(i) be the unknown true probability for cell i, then MLE gives us p^(i) = n(i)/n, as the estimate for p(i) for all values of i.
Factors
A joint probability distribution can be represented using a factor. A factor is simply a mapping from joint instantiations of variables to real numbers. Factors are the workhorses of graphical models.
Saturated models have only one factor, however most joint probability distributions we will study are defined using more than one factor. Saturated models are unusual.
If we let Δ be a set of variables, then for each variable δ ∈ Δ, then define Ιδ to be the variable's values. We can now have a table Ι = ×δ∈ΔΙδ. Each element i or Ι is a vector of values called a cell. A factor with table Ι is therefore simply a function ƒ : Ι → ℜ
Marginalisation
Marginalisation throws some information away from a factor by reducing the number of variables involved. It is easiest to understand by looking at contingency tables (which are also factors). Each cell in the marginal table is simply the sum of all the corresponding cells in the original table.
Marginalisation is just addition that results in a reduction of dimensionality (projecting down a factor to a smaller number of dimensions), and therefore a more coarse grained view. Marginalisation is sometimes called "summing down", and can be done with both data and probability.
We can define marginalisation formally. If we let ƒ(i) be the value associated with the cell i in some factor with the variables Δ and the table Ι, and let a ⊂ Δ then Ιa (the table for the marginal factor defined by a) is given by Ιa = ×δ∈aΙδ, then for any cell i ∈ Ι, let iδ be its value for the variable δ, we can let ia be the projection of i onto a ⊂ Δ defined as ×δ∈aiδ. We can now define ƒa : Ιa → ℜ (the function defining the marginal factor) as: ƒa(ia) = Σj∈Ι:ja=ia ƒ(j).
We can more informally define marginalisation too. If we suppose a factor f involves the variables A, B, C and D (i.e., these variables are the ones that define its table), then we can normally reflect this by writing f as f(A, B, C, D). It is important to note in this informal definition that the factor is actually a function operating on tuples of the values of its variables, not on the variables themselves. With this informal notation, we can represent marginalising away A as: f′(B, C, D) = ΣAf(A, B, C, D).
Probability distributions, assign probabilities to a sample space of events, however in joint probability distributions, each joint instantiation is an event, and they are all mutually exclusive.
A marginal distribution can the be computed by summing out one or more variables from the joint distribution. If the original joint distribution is defined by a single factor, then it is computed by marginalising the factor. As all events are exclusive, marginalising a factor is as simple as adding all the instances involving that factor together, e.g., if factor variable A has two states, A1 and A2, then f′(B, C, D) = f(A1, B, C, D) + f(A2, B, C, D).
Big joint probability distributions are difficult to make sense of, hence the need to compute less informative marginal distributions. Usually, we want marginal distributions with a single variable. Many of the algorithms we will study are efficient ways to compute marginal distributions for all variables in a joint distribution.
Independence
A = a and B = b are two independent events iff P(A = a ∩ B = b) = P(A = a)P(B = b).
Two random variables C and D are independent in a joint distribution P iff: ∀c∈ΙC, d∈ΙD : P(C=c, D=d) = P(C=c)P(D=d), where P(C=c) = Σd∈ΙDP(C=c, D=d) is the marginal probability of C=c, and P(D=d) is the marginal probability of D=d.
This is a strong condition; independendence has to hold for all combinations of values.
If C and D are indepdendent and the distributions for C and D are each represented by a single factor (denoted P(C) and P(D) respectively), then computing a new factor for the joint (denoted P(C,D)) is just a question of multiplying the appropriate values. To do this, we 'broadcast' each factor so that both have the same variables, and then apply pointwise multiplication.
We can now formally define factors. Each factor involved in defining a joint distribution can be defined on the same table. If we had some table Ι such that P(A) and P(B): Ι → ℜ, then this means that factor multiplication is normal pointwise function, and that this table involves all the variables in the joint distribution.
However, particular factors typically only depend on a subset of all the variables in the joint distribution, so we take advantage of this to give them a more compact representation. The data broadcasting makes the existence of this shared table more obvious.
If two random variables C and D are independent, then their joint distribution P(C,D) is simply the product of the two relevant marginal distributions P(C) and P(D). It then follows that the joint distribution can be represented by these two factors - this is our first example of a factored model.
Similarly, if we have n mutually independent random variables, then their joint distribution can be represented by n factors.
As a model is just a set of probability distributions, then for a set of n variables, we can let the independence model be the set of all distributions where there is independence between all variables. Any distribution in this model can be represented by n univariate distributions (each represented by a factor).
If we suppose that each variable has m values, then if no independence relations exist (i.e., the saturated model), a single factor with mn numbers is needed. If we assume the independence model, then n factors each with m values are needed - that is, only nm numbers in total (you can save a bit more if you're desperate, since probabilities must sum to 1 you can figure out 1 of the values by deduction). In addition to these space savings, computations are also far cheaper with the independence model.
The main problem with this is that complete independence rarely models the real world well, so we need models somewhere in between the saturated and independence models. Fortunately, there exists such a class of models - those which use conditional independence.
Conditional Independence
Until now, we have multiplied factors representing independent random variables, but since all factors can be implicitly defined on the same table, the 'broadcasting followed by pointwise multiplication' algorithm can be applied to multiply any two factors.
If you are getting a set of factors {f1, f2, ..., fn} with non-negative values, then we can let Z = Σf1 × f2 × ... × fn. This sum over all the variablse is the scalar you get by marginalising all the variables. As long as Z > 0, the factors define a distribution: P = Z-1f1 × f2 × ... × fn
So P = f1 × f2 × ... × (fi/Z) × ... × fn, for any i.
Computing Z, known as the partition function, is a pain. If we only need probabilities up to a common factor, it is better to represent P implicitly by {f1, f2, ..., fn}.
So, if P = f1 × f2 × ... × fn, we can deduce that if each fi is a univariate factor for a distinct variable, then as we have seen, each variable is independent. We still need to consider the general case, however.
If we let X and Z be subsets of variables of the joint distribution P, then we can let P(X,Z) denote the distribution produced by projecting P down onto X ∪ Z by marginalisation. Similarly, P(Z) is the marginal distribution just on Z.
We can define the distribution on X conditional on Z as P(X|Z) = P(X,Z)/P(Z). P(X|Z) contains conditional distributions P(X|Z = z), one for each instantiation z of Z. P(X|Z = z) is therefore a distribution for X, given that Z = z.
If P(X,Z) and P(Z) are both represented by factors, then P(X|Z) can be computed by pointwise division. P(X|Z) is undefined if there is division by 0.
If we let X and Z be subsets of variables of the joint distribution P, then we can say X is conditionally independent of Y given Z under P if P(X,Y|Z) = P(X|Z)P(Y|Z), or X and Y are independent once we know Z. We write this as X ⊥ Y|Z[P] or just X ⊥ Y|Z when the P is obvious. That is, ⊥ means "is independent of".
Also, if P(X|Z) = P(X|Y,Z), then we can that that the probility of X given Z does not change if Y is also given - that is, X and Y are conditionally independent.
It turns out that factored distributions have conditional independence properties which depend on their structure.
Hypergraphs
What matters is not so much the numbers in the factors, but the relationship between the variables in the various factors. We can consider H = { set of variables used by fi }i, e.g., H = { { A, B }, { B, C }, { C, D }, { A, D } }. H is a hypergraph.
A hypergraph H is a set of subsets of a finite set H - the base set. The elements h ∈ H are called the hyperedges, and we only need to consider hypergraphs where H = ∪h∈Hh.
If a hypergraph consists of sets which are all arity 2, then it can be represented as a graph, where the nodes are the base set H, and the edges are defined by which two nodes are in the same set.
Reducing Hypergraphs
red(H) is the reduced hypergraph produced by removing all hyperedges in H that are contained in some other hyperedge. The hyperedges that are left after reduction are called a generating class.
The example hypergraph earlier H = { { A, B }, { B, C }, { C, D }, { A, D } } is already reduced.
A second example, H′ = { { Smoking }, { Cancer, Smoking}, { Bronchitis, Smoking } }, then red(H′) = { { Cancer, Smoking }, { Bronchitis, Smoking } } - redundant hyperedges correspond to redundant factors.
A generating class (i.e., a reduced hypergraph) H = { h1, h2, ..., hn } defines a set of factored probability distributions. This is simply the set of probability distributions of the form: f1 × f2 × ... × fn, where hi = the set of variables used by fi.
Such a set of probability distributions is called a hierarchial model, so each generating class defines a hierarchial models.
Graphs
A graph is a pair (V,E) where V is a set of vertices and E is a set of edges. E ⊂ V × V. Graphs can be easily represented.
If (α,β) is an edge, and so is (β,α), then it is an undirected edge, which we can express using the notation α~β. If all edges in a graph are undirected, then we can refer to the entire graph as undirected.
Each hypergraph H has an associated undirected graph H[2] = (V,E), where V = the base set H and (α,β) ∈ E ⇔ {α,β} ⊂ h, for some h ∈ H.
H[2] is called the 2-section of H. If H is the hypergraph for a factored distribution, then H[2] is the interaction graph for the distribution.
We call a set of vertices A ⊆ V complete if for all distinct α,β ∈ A: α~β - that is, if all vertices are connected to every other vertice in that set. We call a set of vertices C a clique if it is maximally complete - that is, it is complete and all vertices which added to that subset to make it more complete are added, or, adding any new vertex to C would produce a set that is not complete. For any graph G, the set of all of its cliques C(G) is the clique hypergraph of the graph.
If we let G be a graph whose vertices are the variables of a joint distribution P, then G has a clique hypergraph C(G) and each clique c ∈ C(G) is a subset of the variables in the joint distribution P. The distribution P is said to factorise according to the graph G if there are factors fc such that: P = ∏c∈C(G)fc, where each factor fc only depends on the variables in c.
The Global Markov Property
If we remember that each factored distribution has an associated hypergraph and interaction graph, then because the interaction graph is generated from the distribution's hypergraph, it is easy to see that a factored distribution always factorises according to its own interaction graph.
We say that a joint distribution P over variables V obeys the global Markov property relative to a graph G = (V,E) if for any triple (A,B,S) of disjoint subsets of V, such that S separates A from B in G, we have: A⊥B|S[P]. So, if the global Markov property occurs, we can use the graph to read off conditional independence relations.
So, if P factorises to G, then P obeys the global Markov property relative to G. This means a distribution's interaction graph can be used to read off its conditional independence relations.
The Hammersley-Clifford theorem
The Hammersley-Clifford theorem states that if a distribution P is everywhere positive, then it factorises according to a graph G, if and only if it obeys the global Markov property relative to G.
If there are zeros in P, then we can have the global Markov property without factorisation holding.
There are other Markov properties other than the global one, however they are equivalent if P is everywhere positive.
Variable Elimination
The basic inference problem is that given a joint distribution, we compute the marginal distribution for a given variable.
One option to solve this is to multiply all the factors to get one (often enormous) factor, and then marginalise on that factor as discussed previously. This is only really useful conceptually.
For factored distributions, it is better to interleave multiplication and addition, and this is the basis of the variable elimination algorithm.
If we had to compute xy + xw + xz + xu, then doing so 'directly' would involve 4 multiplications and 3 additions. However, if we rewrote this as the equivalent x(y + w + z + u), then this involves only 1 multiplication and 3 additions.
We can exploit this elementary arithmetic fact for variable elimination:
Replacing several factors by the single factor which is their product does not alter the distribution represented (multiplication is associative).
To sum out several variables, we can sum them out (i.e., eliminate them) one at a time (addition is associative).
If a variables appears in only one factor, then to sum it out of the distribution, it is enough to sum it out of that factor (this is the key to variable elimination).
For information on summing out a variable occuring in only one factor, see the lecture slides 6 and 7.
To eliminate a single variable, we first grab all the factors containing the variable, and then compute their product factor, and then delete them. Finally, we marginalise away the variable from the product factor, and add the resulting factor to the distribution. gPy can do this for us for us, and the algorithm is explained in the source and on slides 9-11.
The order in which variables are summed out is called an elimination ordering - the choice of elimination ordering makes no difference to the final result (due to the commutativity of addition), but it makes an immense difference to the efficiency of the variable elimination algorithm.
For any given elimination ordering, the variable elimination algorithm can be used to generate a cluster forest. For each variable summed out, (in gPy) there is a prod_factor whose variables are prod_factor.variables(). These sets of variables are called clusters, and make up the nodes of the cluster forest.
There is an arrow from prod_factor_1.variables() to prod_factor_2.variables() if the message produced by prod_factor_1 is one of the products in prod_factor_2.
Forests are collection of trees, and trees are graphs with no cycles. Since each cluster produces exactly one message, and a message is deleted once it is used in another cluster, this is enough to ensure that the cluster forest is indeed a forest.
Clusters are actually hyperedges, and the nodes of a cluster forest thus form a hypergraph - generally with a lot of redundancy, however. Hypergraphs whose hyperedges form the nodes of forests will prove to be the key data structure for probabilistic inference in factored distributions.
The general inference problem is that, given a joint distribution and some observed evidence, we need to compute the marginal distribution for a given variable, e.g., how to computer P(Cancer|XRay=abnormal,Smoking=absent).
One way to do this is to delay normalisation. Using the above example, we have P(L,A,T,X,D,B,S,E|X=abnormal,S=absent), which by definition is equal to P(L,A,T,X=abnormal,D,B,S=absent,E) / P(X=abnormal,S=absent), so we can work withP(L,A,T,X=abnormal,D,B,S=absent,E) and delay normalising until the last moment.
If we imagine that P(L,A,T,X=abnormal,D,B,S=absent,E) were represented by one enormous factor, then this factor has the same value of P(L,A,T,X,D,B,S,E) for those rows where X = abnormal and S = absent, but has a value of 0 for those rows where this is not the case. That is, rows for instantiations which contradict the evidence have a value of 0.
In other words, we can define the same unnormalised distribution by altering the original factors - in each factor, just set any row corresponding to an instantiation contradicting the evidence to 0, and leave the other rows as they were, and then just run variable eliminiation as before.
To avoid pointless multiplication (multiplying by 0), it's a bit neater to simply delete any rows which are inconsistent with the evidence - this amounts to not even acknowledging the existence of variables which contradict the evidence. This is just informal, in the real conditional distribution, values don't just diappear, but it is better for algorithms.
The variable elimination algorithm presented here is overly simple, since it does not take advantage of any conditional independence relations.
If we suppose we want the marginal distribution of X, if some factor contains only variables independent of X (perhaps because we have conditioned on some evidence), then we just delete that factor. It's not too difficult to concoct scenarios where this gives you a massive speed-up.
Bayesian Networks
The structure of a Bayesian net is given by a directed acyclic graph (DAG) whose nodes are the variables of a distribution.
If there's an arrow A → B in a directed graph, then A is a parent of B and B is a child of A. A directed graph (or a digraph) is acyclic if the graph contains no loops which follow the direction of the arrows. A DAG defines a partial order on the nodes. A total ordering of nodes consistent with a given DAG is one consistent with this partial ordering.
For each variable, there is a conditional probability table (CPT) which defines a distribution over the values of the variable, for each joint instantiation of the parents of the variable.
Note that each CPT is a factor, and the distribution represented by a Bayesian network is the product of all the CPTs. So, Bayesian nets are factored representations of probability distributions, and so everything we have said about such more general factored representations holds for Bayesian networks as well. Unsurprisingly, Bayesian networks have extra properties.
A product of CPTs, one for each variable, defines a distribution directly, no matter what the numbers are in the CPTs, as long as the corresponding digraph is acyclic. Additionally, there is no need to normalise.
If we ignore that CPTs are actually CPTs and treat them just as factors, then the resulting interaction graph is called a moral graph. Unmarried parents get connected (since they are together in some CPT), and the graph becomes undirected. With this in mind, it's not difficult to see that a Bayesian network factorises according to its moral graph.
Since a Bayesian network factorises according to its moral graph, we can read off some conditional independencies from its moral graph. However, there are further conditional independence relations which hold in a Bayesian network, but can not be deduced from the moral graph. To get these, we need to exploit properties of the CPTs.
A key property of a CPT is that, for any configuration of its parents, the values for the child must add up to one. This is what makes it a CPT.
So, if we marginalise away the child variable in a CPT, we end up with a factor of ones. Factors where all the data values = 1 can be deleted from a factored distribution without altering the distribution. So, if a variable only appears in its own CPT, then we can marginalise it away from the Bayesian network by simply removing the CPT.
In a given digraph, the ancestors of vertex X, denoted as an(X) is the set of nodes Y such that there is a directed path from Y to X. That is, it's just the transitive closure of the parent relationship.
A set of nodes X is an ancestral set if and only if for any node x ∈ X, we have an(x) ⊆ X. We can now let an(Z) denote the smallest ancestral set containing a set of nodes Z but just adding the ancestors recursively.
If we let P be a joint distribution and X be some subset of its variables, then we denote the distribution produced by marginalising P onto X as PX. If G is a graph, then let GX be the graph formed by removing for G all the nodes not in X.
Using this, we can let P be defined by a Bayesian network with acyclic digraph G and let X be an ancestral set in G, and then PX is given by the Bayesian network with the DAG GX (and the relevant CPTs deleted).
Now, for conditional independence, if we want to know whether the Bayesian network structure implies A⊥B|S for disjoint sets of variables A, B, S, then first construct the DAG for the distribution PAn(A∪B∪S). This is just GAn(A∪B∪S) where G is the original DAG. Then we construct the moral graph for this smaller DAG, and use this smaller moral graph to check for separation.
According to Lauritzen, if we let P factorise recursively according to G, then A⊥B|S, whenever A and B are separated by S in (GAn(A∪B∪S))m, the moral graph of the smallest ancestral set containing A ∪ B ∪ S.
This property, which recursive factorisation implies, is the directed global Markov property.
In some DAG G, we let nd(A) be the non-descendants of node A and pa(A) by the parents of A, then a distribution P obeys the directed local Markov property relative to G, if each variable is independent of its nondescendants given its parents: A⊥nd(A)|pa(A).
It turns out that for any DAG G, recursive factorsation, the global Markov property and the local Markov property are all equivalent.
For inference in Bayesian networks, since a Bayesian net is a factored distribution, one option is just to use variable eliminaton as previously described. There are also Bayesian network specific options for inference, both exact and approximate.
Triangulation
In variable elimination, to sum out a variable we have to first multiply all factors containing that variable. This corrseponds to merging the associated hyperedges. If there is more than one factor, this may produce a new hyperedge. We can then sum out the variable - this corresponds to removing the variable from the possibly new hyperedge. It's easy to see this process by looking at the distributions interaction graph.
If we recall that two variables are connected in the interaction graph if they both appear in a common hyperedge (i.e., they are both variables for some factor), then merging all hyperedges containing a given variables v corresponds to connecting in the interaction graph all variables in these hyperedges. This new connection is called a fill-in.
Deleting a vertex from the only hyperedge to contain it corresponds to deleting that vertex from the graph.
This process is known as triangulation. If carried out exactly as described we would end up with an empty graph (since all variables would eventually get deleted). Instead, we just mark 'eliminated' vertices and treat them as if they were deleted in the algorithm.
The triangulated graph gives a nice graphical representation of the complexity of variable elimination with a particular ordering. If there are n variables and N is the size (number of rows) of the biggest factor, then variable elimination is Ο(nN).
If this biggest factor has μ variables each with no more than v values, then we only get an exponential upper bound: N ≤ vμ. Therefore, we need to keep μ, the size of the biggest clique, small.
Even deciding whether there is an ordering which bounds the biggest clique by some value is NP-complete, thus finding an ordering which minimises the biggest clique is a hard problem.
Nevertheless, there are some reasonable heuristic approaches. One greedy option is, at each point, to eliminate whichever vertex produces the fewest fill-in lines (preferably none). Another option is to compute an elimination ordering backwards, using maximum cardinality search.
Maximum Cardinality Search
To perform maximum cardinality search, we must number the vertices from [n - 1] to [0] in descending order. As the next vertex to number, we select the vertex adjacent to the largest number of previously numbered vertices, breaking ties arbitrarily.
If there are n vertices and m edges in the graph, the maximum cardinality search if Ο(n + m), that is, it's linear. If there is a zero fill-in for a graph, maximum cardinality search will find one. If there is not, then Kjærulff has shown that the order it finds is generally not that good.
A triangulated graph is defined to be an undirected graph such that every cycle of length n ≥ 4 has a chord: two non-consecutive vertices which are neighbours. Because of this, sometimes triangulated graphs are called chordal. A graph is triangulated if and only if it has zero fill-in.
An elimination ordering for a graph is just an ordering of the vertices of the graph.
The fill-in generated by an elimination ordering is the collection of extra lines added by running the vertex elimination algorithm with this ordering.
An elimination ordering is a zero fill-in if its fill-in is empty.
Join Forests
The cliques of a triangulated graph can be arranged in an important data structure, a join forest. These are very closely related to the cluster trees we have seen previously.
A join forest is a collection of trees (connected graphs without cycles), whose vertices are the cliques of the triangulated graphs.
We can construct them by doing vertex elimination on a distribution's hypergraph, rather than its interaction graph.
A hypergraph is decomposable if its reduction is the clique hypergraph of a decomposable (i.e., triangulated) graph. This is not usually the given definition for decomposable hypergraphs, but it may as well be - in maths literature, decomposable hypergraphs are usually called acyclic hypergraphs.
Graham's Algorithm
Graham's algorithm is for variable elimination on hypergraphs. To apply it, you repeat the following two operations on a hypergraph until neither apply:
Delete a vertex that appears in only one hyperedge.
Delete a hyperedge that is contained in another hyperedge.
A hypergraph is decomposable if and only if Graham's algorithm deletes all vertices.
Join Trees
A join tree T for a hypergraph H is a tree such that:
the vertices of T are the hyperedges of H, and
for all h1, h2 ∈ H, and any h on the unique path between h1 and h2 in T, we have h1 ∩ h2 ⊆ h.
This second condition is the join, or junction, property. Join trees are sometimes called junction trees.
A join forest for a hypergraph H is simply a collection of join trees for hypergraphs Hi where H is the disjoint union of the Hi such that hyperedges in different trees are disjoint.
However, most of the time our join forests contain just a single tree, so most of the literature talks about junction trees, rather than join forests.
A fundamental theorem is that a hypergraph is decomposable if and only if it has a join forest.
Constructing Join Forests
There are a number of different ways of construction a join forest for a decomposable hypergraph. If we recall that a decomposable hypergraph is always the clique hypergraph of some triangulated graph, and that the triangulated graph has an associated elimination order which is a zero fill-in, then if we run variable elimination with this ordering, then each elimination happens in a clique. If a clique Ci produces a message which is used by clique Cj, then we can connect these two cliques in a join forest.
To construct a join forest with Graham's algorithm, then we first delete a vertex that appears in only one hyperedge, and then delete a hyperedge that is contain in another hyperedge.
Graham's algorithm is just structural variable elimination on a decomposable hypergraph using a zero fill-in ordering. When a hypergraph gets deleted, just connect it to one of the hyperedges which contained it.
Typically, we want marginal distributions for all uninstantiated variables. We could run variable elimination from scratch for each of them, but, whatever the elimination ordering, we could end up repeating a lot of computation work.
A better way is to run variable elimination in a join forest. If we remember that the cluster tree representation of the execution of the variable elimination algorithm, then a join forest is essentially a cluster tree with redundancy removed, so, unsurprisingly, we can run variable elimination in a join forest.
This means running variable elimination with an elimination ordering which can be used to run Graham's algorithm. If we assume further that this ordering is such that, when we eliminate an isolated variable (i.e., it's in only one clique), we also eliminate all other isolated variables in that clique, if any. It is easy to see that this still gives us a zero fill-in.
Since our distribution is decomposable, there is a one-one mapping between the initial factors and the nodes (i.e., cliques) in the join forest. It is useful to think of each factor as being located at its corresponding clique.
Once all the isolated vertices in a clique have been summed out, a new factor is created (unless the clique contained only isolated vertices). This factor is called a message. Rather than adding messages as a new factor in the factored distribution, we will immediately multiply them into an existing factor.
Decomposability guarantees that there is always a receiving factor which includes all the variables of the message. The corresponds to the 'deleting a redundant hyperedge' step of Graham's algorithm.
Since the join forest can be built from running Graham's algorithm, the sending and receiving cliques will be directly connected in the join forest.
To retain the join forest, rather than actually deleting a factor once it has produced a message, we can just treat it as if it were deleted. So, to run variable elimination in a join tree, each clique except the last 'sends a message' to its neighbour. The last clique is the root clique - once it has collected all its messages, it contains the marginal distribution for the variables it contains. The generalisation from this to join forests is not very interesting.
Now, suppose we have run variable elimination in a join tree as just described. If we consider a clique C2 neighbouring the root clique C1, then C2 has not received a message from the root C1, but has received all the other messages it would have recieved had it been the root instead of C1.
If C1 were to send it a message now, it would be the wrong message, since C1 includes (by way of multiplication) the message sent to it from C2.
If we store the message mC2→C1 that C2 sends to C1, then C1 can send the right message back to C2, even after receiving mC2→C1. It just divides the normal message (produced by marginalisation) by mC2→C1, thus cancelling out the effect of previously receiving it. Then, C2 will end up with exactly the same factor it would have had, had it been the original root.
This argument can then be extended to all the cliques in the join forest. Storing messages is the key to the probability propagation algorithm in a join forest. It's best to imagine the message being stored in the lines connecting neighbouring cliques in the join forest.
It is now becoming obvious that join forests are the right structure for parallel variable elimination. We can consider a more elegant characterisation of what's going on. A separator between two neighbouring cliques in a join tree is just the intersection of the variables in the 2 cliques. The variables of the message constitute a separator. We can associate factors with separators, just as we associate factors with the cliques.
If we had a decomposable distribution p, such that p = ∏c∈Cfc, where C are cliques, the vertices of a join forest. If we now let S be the separators of a join forest, and set 1s to be a table of 1s for each s ∈ S, we can then write p = ∏c∈CfC / ∏s ∈ S 1s.
If we suppose C1 sends a message mC1→C2 to C2, and their separator is S, to effect the probability propagation algorithm, the factor for C2 gets multiplied by mC1→C2 / fS, and so does the factor for S. Therefore, sending a message does not alter the distribution, and we have a loop invariant.
Once each clique has received all its messages, it will contain the marginal distribution for its variables. We just need to ensure that all messages in both directions are passed.
One option to accomplish this is to choose an arbitrary root, and then send all messages towards this root (the upward pass), and then send all messages in the direction away from the root (the downward pass).
A perfect sequence is an ordering of the cliques in a join tree where the first clique is a root, and such that if Ci is nearer this root than Cj, then Ci < Cj.
It is important to note that this is not the actual definition, and can be extended to join forests.
Perfect sequences can be used to schedule messages.
It is possible for distributions to be made to have a decomposable hypergraph.
Find a (hopefully) small fill-in for the interaction graph, or do something equivalent directly on the hypergraph.
Map each original factor to a clique which includes variables (this is always possible).
The factor for each clique in the new hypergraph is a (possibly empty) product of those factors mapped to it.