Starting from:

$30

Homework 1- Bayesian Networks

1 Bayesian Networks [20 points] (Xun)
State True or False, and briefly justify your answer in a few sentences. You can cite theorems from Koller and Friedman (2009). Throughout the section, P is a distribution and G is a BN structure. 1. [2 points] If A ⊥ B | C and A ⊥ C | B, then A ⊥ B and A ⊥ C. (Suppose the joint distribution of A,B,C is positive.)
A B C
D E
Figure 1: A Bayesian network.
2. [2 points] In Figure 1, E ⊥ C | B. 3. [2 points] In Figure 1, A ⊥ E | C. P factorizes over G I(G) ⊆I(P) I`(G) ⊆I(P) (1) (2)
(3)
Figure 2: Some relations in Bayesian networks.
4. [2 points] In Figure 2, relation (1) is true.
5. [2 points] In Figure 2, relation (2) is true.
6. [2 points] In Figure 2, relation (3) is true.
1
7. [2 points] If G is an I-map for P, then P may have extra conditional independencies than G. 8. [2 points] Two BN structuresG1 andG2 are I-equivalent iff they have the same skeleton and the same set of v-structures.
9. [2 points] The minimal I-map of a distribution is the I-map with fewest edges.
10. [2 points] The P-map of a distribution, if exists, is unique.
2
2 Undirected Graphical Models [25 points] (Paul)
2.1 Local, Pairwise and Global Markov Properties [18 points]
1. Prove the following properties: • [2 points] If A ⊥ (B∪D) | C then A ⊥ B|C. • [2 points] If A ⊥ (B∪D) | C then A ⊥ B|(C ∪D) and A ⊥ D | (B∪C). • [2 points] For strictly positive distributions, if A ⊥ B | (C ∪ D) and A ⊥ C | (B∪D) then A ⊥ (B∪C) | D. 2. [6 points] Show that for any undirected graph G and distribution P, if P factorizes according to G, then P will also satisfy the global Markov properties of G.
3. [6 points] Show that for any undirected graph G and distribution P, if P satisfies the local Markov property with respect to G, then P will also satisfy the pairwise Markov property of G.
2.2 Gaussian Graphical Models [7 points]
Now we consider a specific instance of undirected graphical models. Let X = {X1,...,Xd} be a set of random variables and follow a joint Gaussian distribution X ∼N(µ,Λ−1) where Λ ∈S++ is the precision matrix. Let Xj,Xk be two nodes in X, and Z = {Xi | i /∈{j,k}} denote the remaining nodes. Show that Xj ⊥ Xk | Z if and only if Λjk = 0.
3
3 Exact Inference [40 points] (Xun)
3.1 Variable elimination on a grid [10 points]
Consider the following Markov network:
A B C
D E F
G H I
We are going to see how tree-width, a property of the graph, is related to the intrinsic complexity of variable elimination of a distribution.
1. [2 points] Write down largest clique(s) for the elimination order E,D,H,F,B,A,G,I,C.
2. [2 points] Write down largest clique(s) for the elimination order A,G,I,C,D,H,F,B,E.
3. [2 points] Which of the above ordering is preferable? Explain briefly. 4. [4 points] Using this intuition, give a reasonable (? n2) upper bound on the tree-width of the n×n grid.
3.2 Junction tree in action: part 1 [10 points]
Consider the following Bayesian network G:
A B C
E D
We are going to construct a junction tree T from G. Please sketch the generated objects in each step. 1. [1 pts] Moralize G to construct an undirected graph H. 2. [3 pts] Triangulate H to construct a chordal graph H∗. (Although there are many ways to triangulate a graph, for the ease of grading, please use the triangulation that corresponds to the elimination order A,B,C,D,E.)
4
3. [3 pts] Construct a cluster graph U where each node is a maximal clique Ci from H∗ and each edge is the sepset Si,j = Ci ∩Cj between adjacent cliques Ci and Cj. 4. [3 pts] Run maximum spanning tree algorithm on U to construct a junction tree T. (The cluster graph is small enough to calculate maximum spanning tree in one’s head.)
3.3 Junction tree in action: part 2 [20 points]
Continuing from part 1, now assume all variables are binary and the CPDs are parameterized as follows:
A P(A) 0 x0
A B P(B|A) 0 0 x1 1 0 x2
A E P(E|A) 0 0 x3 1 0 x4
B C P(C|B) 0 0 x5 1 0 x6
C E D P(D|C,E) 0 0 0 x7 0 1 0 x8 1 0 0 x9 1 1 0 x10
We are going to implement belief propagation onT. The provided template junction_tree.py contains the following tasks: • initial_clique_potentials(): Compute initial clique potentials ψi(Ci) from factors φi. • messages(): Compute messages δi→j from initial clique potentials ψi(Ci). • beliefs(): Compute calibrated clique beliefs βi(Ci) and sepset beliefs µi,j(Si,j), using initial clique potentials ψi(Ci) and messages δi→j. • Using the beliefs βi(Ci),µi,j(Si,j), compute – query1(): P(B) – query2(): P(A|C) – query3(): P(A,B,C,D,E)
Please finish the unimplemented TODO blocks and submit completed junction_tree.py to Gradescope (https://www.gradescope.com/courses/36025).
In the implementation, please represent factors as numpy.ndarray and store different factors in a dictionary with its scope as the key. For example, as provided in the template, phi[’ab’] is a factor φAB represented as a 2×2 matrix, where phi[’ab’][0, 0] = φAB(A = 0,B = 0) = P(B = 0|A = 0) = x1. For messages, one can use delta[’ab_cd’] to denote a message from AB to CD. Most functions can be written in 3 lines of code. You may find np.einsum() useful.
5
4 Parameter Learning [15 points] (Xun)
Y1 Y2 ··· YT
X1 X2 XT Consider an HMM with Yt ∈ [M], Xt ∈ RK (M,K ∈ N). Let (π,A,{µi,σ2 i}M i=1) be its parameters, where π ∈ RM is the initial state distribution, A ∈ RM×M is the transition matrix, µi ∈RK and σ2 i 0 are parameters of the emission distribution, which is defined to be an isotropic Gaussian. In other words,
P(Y1 = i) = πi (1) P(Yt+1 = j|Yt = i) = Aij (2) P(Xt|Yt = i) = N(Xt;µi,σ2 i I). (3)
We are going to implement the Baum-Welch (EM) algorithm that estimates parameters from data X ∈ RN×T×K, which is a collection of N observed sequences of length T. Note that there are different forms of forward-backward algorithms, for instance the (α,γ)-recursion, which is slightly different from the (α,β)-recursion we saw in the class. For the ease of grading, however, please implement the (α,β) version, and remember to normalize the messages at each step for numerical stability.
Please complete the unimplemented TODO blocks in the template baum_welch.py and submit it to Gradescope (https://www.gradescope.com/courses/36025). The template has its own toy problem to verify the implementation. The test cases are ran on other randomly generated problem instances.

More products