1 Sequential models for POS tagging [30 pts] (Maruan)
Conditional Random Field (CRF) [1] is a popular probabilistic model used for structured prediction or modeling conditional probability distributions over structured spaces. In Homework 1, you have implemented an algorithm for learning parameters of an (unsupervised) Hidden Markov Model (HMM). In this problem, we will apply a (supervised) HMM and a CRF to part-of-speech (POS) tagging and compare the two models.
In part-of-speech tagging, we are given a sentence and would like to categorize each word (or assign it a tag) to a particular part of speech. Here is an example:
Sentence: The reaction in the newsroom was emotional . POS tags: DET NOUN ADP DET NOUN VERB ADJ .
In this problem, we will use the universal tagset [2] (which consists of 12 tags), and train and evaluate our algorithms on the Penn Treebank (PTB) corpus. Download the data from: https://sailinglab.github. io/pgm-spring-2019/assets/assignments/hw2/pos-data.zip.
The data folder must contain the following files: train set.npy, test set.npy, vocab.txt, tagset.txt, that correspond to the training sentences, testing sentences, the vocabulary, and the tagset, respectively.
1.1 POS tagging with HMM [10 pts]
Since assigning POS tags to words outside of context can be very ambiguous (e.g., try to tag “the sailor dogs the hatch”), we need a model that can take into account the whole sequence of words and infer the most likely sequence of corresponding tags.
1. [1 point] Denoting the sequence of T words as x := (x1,...,xT) and the corresponding sequence of POS tags as y := (y1,...,yT), define an HMM for modeling these sequences: (a) specify how you represent observations and latent states, and (b) specify the parameters of your model.
2. [3 points] Given a set of tagged training sentences, describe and implement the learning algorithm for your HMM. Note that due to sparsity, estimation of emission probabilities requires smoothing (use λ = 0.01 for smoothing). Report the following: • The obtained state transition matrix A (round or truncate the values in the matrix to a reasonable precision for visualization purposes). • The negative log likelihood for your HMM on the train and test sets.1 1You can report either −logPθ(x) or −logPθ(x,y), but specify which one you report.
1
3. [6 pts] Now having a trained model, we can answer queries such as, what is the most likely sequence of tags for the given sequence of words? Mathematically, we answer such queries by solving the following optimization problem: y? = argmax y PHMM(y | x) (1) Describe and implement an efficient algorithm for solving (1) (hint: you can use some primitives from the Baum-Welch algorithm that you implemented in HW1). Infer the most likely sequences of tags for the test set sentences. Report the per-word accuracy and visualize the confusion matrix between the tags.
1.2 POS tagging with a linear-chain CRF [20 pts]
POS tagging is a supervised learning problem with sequential inputs and sequential outputs. Instead of using HMM (a generative model), we could try to directly model the conditional distribution P(y | x) using the following form: P(y | x) = 1 Z(x) T Y t=1 Ψt(yt,yt−1,x), (2) where Ψt(yt,yt−1,x) is a non-negative potential function that inputs a pair of tags yt,yt−1 and a sequence of words x, and Z is the normalizing constant. 1. [2 pts] Show that the conditional distribution, P(y | x), that results from an HMM is a particular kind of a linear chain CRF (hint: write down the complete likelihood for the HMM as a product of exponents). 2. [2 pts] Directly modeling P(y | x) with a CRF allows us to additionally “featurize” the model, or in other words, define additional potentials Ψt(yt,yt−1,x) that may depend on non-trivial features extracted from the input sequence. Mathematically define a linear chain CRF model that has potentials of the form Ψt(yt,yt−1,x) which are log-linear functions of the following features: (a) identity of the given word, xt, and the current tag, yt, (b) identity for whether xt is the first word in the sentence, and the current tag, yt, (c) identity for whether xt is the last word in the sentence, and the current tag, yt, (d) identity of the previous word, xt−1, and the current tag, yt, (e) identity of the next word, xt+1, and the current tag, yt, (f) identity for whether xt contains a capital letter, and the current tag, yt, (g) identity of the previous tag, yt, and the current tag, yt,
3. [2 pts] Derive an inference algorithm for determining the most likely sequence of POS tags under your CRF model (hint: the algorithm should be very similar to the one you designed for HMM in 1.1).
4. [2 pts] Derive a maximum likelihood learning algorithm for your linear chain CRF.
5. [10 pts] Implement the learning and inference procedures for CRF that you derived in the previous steps.
(a) First, pre-process the training and test data to incorporate the additional features we defined for the potential functions.
(b) Implement the learning and inference algorithms as you described above. External libraries that provide complete implementations of CRF models are prohibited (you would get 0/10 credit for
2
this part of the problem). However, you can use any external packages for solving optimization problems as building blocks for your algorithms (including PyTorch or TensorFlow).
(c) Finally, train your model on the train set and run inference on the test set. Report the following: i. The negative log conditional likelihood (i.e., −logP(y | x)) of your trained CRF model on both training and test sets.
ii. Per-word accuracy and the confusion matrix for your model predictions on the test set.
6. [2 pts] As we have seen, HMM and CRF are respectively generative and discriminative approaches to solving the problem of sequence prediction. Discuss advantages/disadvantages of each model. What if we have limited labeled data?
3
2 Variational Inference (VI) [40 pts] (Hao)
2.1 Warm-up: VI vs. MCMC [10 pts]
We will compare Variational Inference (VI) with Markov Chain Monte Carlo (MCMC), the two most popular methods for approximate inference. Below are a list of algorithmic properties, or problem cases that you want to choose one of them to solve. For each item, link to either VI or MCMC, and reason briefly with 1-2 sentences.
1. [1 point] Inference results are generally closer to target distributions.
2. [1 point] Non-parametric.
3. [1 point] Amendable to batched computation using GPUs.
4. [1 point] Transform inference into optimization problems.
5. [1 point] Easier to integrate with back-propagation.
6. [1 point] Involves more stochasticity.
7. [1 point] Easier to set the termination condition for the computational loop.
8. [1 point] Higher variance under limited computational resources.
9. [1 point] Problem case: Estimating a topic model with online streaming text data.
10. [1 point] Problem case: Estimating a topic model from a very small text corpus.
11. [Bonus, 1 point] Problem case: Posterior inference with non-conjugate distributions.
12. [Bonus, 1 point] More naturally fit with stick-breaking process.
2.2 VI in Practice [30 pts]
2.2.1 A Review of the Vanilla LDA [9 pts]
We have covered the Latent Dirichlet Allocation (LDA) [3], one of the most successful topic models in class. Given a collection of text documents D = {wd}D d=1,we denote wd as the dth document in the corpus, and wd = {wdi}Nd i=1 with wdi being the ith word in wd, and Nd as the number of words in wd. LDA assumes each document is generated from a mixture of K topics, with the mixture proportion θd, where each topic is a multinomial distribution parametrized by βk over all V words existed in the vocabulary. The generative process of LDA is therefore as follows: • For k = 1 → K: – Draw topic βk ∼ Dirichlet(η). • For each document wd ∈D: – Draw per-document mixture proportion θd ∼ Dirichlet(α). – For each word wdi ∈ wd: ∗ Sample a topic indicator zdi ∼ Multinomial(θd) ∗ Sample the word wdi ∼ Multinomial(βzdi)
4
where we have used η and α as the parameters of the two prior Dirichlet distributions for β and θ, respectively. The graphical model diagram of LDA is showed in Figure 1. Answer the following questions about LDA (no need to explain).
θd zdi wdiα
βkη
D
Nd
K
Figure 1: The graphical model diagram for LDA.
1. [1 point] True or False: Solving LDA involves posterior inference with non-conjugate distributions.
2. [1 point] True or False: Solving LDA using Variational EM involves non-convex optimization at E-step.
3. [2 points] Count the number of model parameters of the LDA using notations defined above.
4. [2 points] Write down the computational complexity of mean-field VI for LDA using Big O notation.
5. [2 points] Write down the memory complexity of mean-field VI for LDA using Big O notation.
6. [1 point] True or False: The vanilla LDA can be used to extract word embeddings.
2.2.2 More HMMs [21 pts]
The vanilla LDA generally assumes each word is independently generated given the topic vectors. In this question, we will try to enhance it by further considering the orders of the words during the document generation process, as we believe in many natural languages, the topic of the word at index i might be dependent on the topic of its previous word (at index i−1). A natural and powerful approach to modeling orders is to (once again) use the Hidden Markov Model (HMM). To incorporate HMM into LDA, we assume a transit matrix T among K topics where the entry at the mth row and nth column of T, denoted as Tmn, represents the probability of the mth topic transiting to the nth topic. With T introduced, it is convenient for us to model sequential dependencies in the hidden space: instead of sampling each word independently, we let the word wdi(i 1), with probability p(0 < p < 1), sampled from the topic zdi, while with probability 1−p, transited from the topic zd(i−1), which is the topic indicator of the word wd(i−1). We therefore can alter the generative process as follows: • For each document wd: – Sample pd ∼ Beta(γ). – Sample θd ∼ Dirichlet(α). – For the first word (the head of HMM), sample zd1 ∼ Multinomial(θd), wd1 ∼ Multimomial(βzd1). – For each word wdi (i 1): ∗ Sample δdi ∼ Bernoulli(pd). ∗ If δdi = 1: sample zdi ∼ Multinomial(θd), then wdi ∼ Multimomial(βzdi). ∗ If δdi = 0: sample zdi ∼ Multinomial(Tzd(i−1),:), then wi ∼ Multimomial(βzdi).
5
Where we use Tm,: to denote the mth row of T. The posteriors of this model are still intractable, and we will use variational EM to perform posterior inference and estimate model parameters.
1. [3 points] Identify (a) the set of observed variables, (b) the set of hidden variables, (c) the set of model parameters to be estimated.
2. [3 points] Based on the altered generative process and Figure 1, draw the new graphical model for HMM-enhanced LDA (HMM-LDA).
3. [2 points] Write down the joint distribution defined by this generative process.
4. [3 points] Define variational distributions over hidden variables using proper distributions and variational parameters.
5. [2 points] Derive the ELBO using variational distributions.
6. [4 points] Derive the update rules for all variational parameters, which will be used in the E step.
7. [4 points] Derive the update rules for all model parameters, which will be applied in the M step.
8. [Bonus, 6 points] Implement the derived variational EM algorithm for HMM-LDA using Python (≥ 3.5). Train the model on the provided Wikipedia corpus (more details about the dataset can be found in the bundled readme.txt). Set the number of topics K = 10.
9. [Bonus, 4 points] Visualize each of the learned 10 topics by printing its top 10 words with highest probabilities, both in your script and in your write-up.
If you choose to do question 8 and 9, please read the implementation notes below carefully before submitting your code:
1. Only submit a single python script named YourAndrewID.py and a requirements.txt where you put all your needed python requirements. Assuming the dataset file is placed on the same directory path with your script, make sure your script is executable (without bugs) under a clean python (≥ 3.5) virtual environment with the following two commands:
pip install -r requirements.txt python YourAndrewID.py
2. Tune your hyper-parameters properly to make sure the model could converge correctly. Print the values of ELBO after each iteration of E-step and M-step in stdout during the training process. Report the iteration of convergence. Print the value of ELBO after convergence in stdout.
3. Visualize the learned K topics after convergence in both stdout and the your write-up in a way similar to the upper part of Figure 8 in [3].
6
3 Importance Sampling [10 pts] (Paul)
Given a random distribution p(x) on x = [x1,...,xD] ∈RD, suppose we want to perform inference Ep(x)[f(x)] using importance sampling, with q(x) as the proposal distribution. According to importance sampling, we draw L i.i.d. samples x(1),...,x(L) from q(x), and we have
Ep(x)[f(x)] ≈
1 PL i=1 ui
L X i=1
f(x(i))ui (3)
where the (unnormalized) importance weights ui = p(x(i)) q(x(i)). 1. [2 points] Find the mean and variance of the unnormalized importance weights Eq(x)[ui] and Varq(x)[ui]. You may leave your answer in terms of Ep(x). 2. [3 points] Prove the following lemma: Ep(x)hp(x) q(x)i≥ 1 with equality if and only if p = q. 3. [4 points] A measure of the variability of two components in vector u = [u1,...,uL] is given by Eq(x)[(ui −uj)2]. Assume that p 6= q and that both p and q can be factorized, i.e. p(x) =QD i=1 pi(xi), and q(x) =QD i=1 qi(xi). Show that Eq(x)[(ui −uj)2] has exponential growth with respect to D. 4. [1 points] Use the conclusion in (c) to explain why the standard importance sampling does not scale well with dimensionality and would blow up in high-dimensional cases.
7
4 Markov Chain Monte Carlo [20 pts] (Paul)
Nowadays, statistical modelling of sport data has become an important part of sports analytics and is often a critical reference for the managers in their decision-making process. In this part, we will work on a real world example in professional sports. Specifically, we are going to use the data from the 2013-2014 Premier League, the top-flight English professional league for men’s football clubs, and build a predictive model on the number of goals scored in a single game by the two opponents. Bayesian hierarchical model is a good candidate for this kind of modeling task. We model each team’s strength (both attacking and defending) as latent variables. Then in each game, the goals scored by the home team is a random variable conditioned on the attacking strength of the home team and the defending strength of the away team. Similarly, the goals scored by the away team is a random variable conditioned on the attack strength of the away team and the defense strength of the home team. Therefore, the distribution of the scoreline of a specific game is dependent on the relative strength between the home team A and the away team B, which also depends on the relative strength between those teams with their other opponents.
Table 1: 2013-2014 Premier League Teams
Index 0 1 2 3 4 Team Arsenal Aston Villa Cardiff City Chelsea Crystal Palace Index 5 6 7 8 9 Team Everton Fulham Hull City Liverpool Manchester City Index 10 11 12 13 14 Team Manchester United Newcastle United Norwich City Southampton Stoke City Index 15 16 17 18 19 Team Sunderland Swansea City Tottenham Hotspurs West Bromwich Albion West Ham United
Here we consider using the same model as described by Baio and Blangiardo (2010) [4]. The Premier League has 20 teams, and we index them as in Table 1. Each team would play 38 matches every season (playing each of the other 19 teams home and away), which totals 380 games in the entire season. For the g-th game, assume that the index of home team is h(g) and the index of the away team is a(g). The observed number of goals (yg0,yg1) of home and away team is modeled as independent Poisson random variables: ygj|θgj ∼ Poisson(θgj), j = 0,1 (4) where θ = (θg0,θg1) represents the scoring intensity in the g-th game for the team playing at home (j = 0) and away (j = 1), respectively. We put a log-linear model for the θs: logθg0 = home + atth(g) −defa(g) (5) logθg1 = atta(g) −defh(g) (6) Note that team strength is broken into attacking and defending strength. And home represents home-team advantage, and in this model is assumed to be constant across teams. The prior on the home is a normal distribution: home ∼N(0,τ−1 0 ) (7) where we set the precision τ0 = 0.0001.
The team-specific attacking and defending effects are modeled as: attt ∼N(µatt,τ−1 att) (8) deft ∼N(µdef,τ−1 def) (9)
8
We use conjugate priors as the hyper-priors on the attack and defense means and precisions: µatt ∼N(0,τ−1 1 ) (10) µdef ∼N(0,τ−1 1 ) (11) τatt ∼ Gamma(α,β) (12) τdef ∼ Gamma(α,β) (13) where the precision τ1 = 0.0001, and we set parameters α = β = 0.1.
This hierarchical Bayesian model can be represented using a directed acyclic graph as shown in Figure 2.
yg0 yg1
θy0 θy1
atta(g) defh(g)home atth(g) defa(g)
τattµ att µdef τdef
Figure 2: The DAG representation of the hierarchical Bayesian model. Figure adapted from [4].
The goals of each game are y = {ygj|g = 0,1,...,379,j = 0,1} are the observed variables, and parameters θ = {home,att0,def0,...,att19,def19} and hyper-parameters η = (µatt,µdef,τatt,τdef) are unobserved variables that we need to make inference on. To ensure identifiability, we enforce a corner constraint on the parameters (pinning one team’s parameters to 0,0). Here we use the first team as reference and assign its attacking and defending strength to be 0: att0 = def0 = 0 (14)
In this question, we want to estimate the posterior mean of the attacking and defending strength for each team, i.e. Ep(θ,η|y)[atti], Ep(θ,η|y)[defi], and Ep(θ,η|y)[home]. 1. [4 points] Find the joint likelihood p(y,θ,η).
2. [4 points] Write down the Metropolis-Hastings algorithm for sampling from posterior p(θ,η|y), and derive the acceptance function for a proposal distribution of your choice (e.g. isotropic Gaussian).
9
3. [12 points] Implement the Metropolis-Hastings algorithm to inference the posterior distribution. The data can be found from https://sailinglab.github.io/pgm-spring-2019/assets/assignments/ hw2/premier_league_2013_2014.dat, which contains a 380×4 matrix. The first column is the number of goals yg0 scored by the home team, the second column is the number of goals yg1 scored by the away team, the third column is the index for the home team h(g), and the fourth column is the index for the away team a(g). • Use an isotropic Gaussian proposal distribution N(0,σ2I) and use 0.1 as the starting point. • Run the MCMC chain for 5000 steps to burn in and then collect 5000 samples with t steps in between (i.e., run M-H for 5000t steps and collect only each t-th sample). This is called thinning, which reduces the autocorrelation of the MCMC samples introduced by the Markovian process. The parameter sets are σ = 0.005,0.05,0.5, and t = 1,5,20,50. • Plot the trace plot of the burn in phase and the MCMC samples for the latent variable home using proposal distributions with different σ and t. • Estimate the rejection ratio for each parameter setting, report your results in a table. • Comment on the results. Which parameter setting worked the best for the algorithm? • Use the results from the optimal parameter setting: (a) plot the posterior histogram of variable home from the MCMC samples. (b) plot the estimated attacking strength Ep(θ,η|y)[atti] against the estimated defending strength Ep(θ,η|y)[defi] for each the team in one scatter plot. Please make sure to identify the team index of each point on your scatter plot using the index to team mapping in Table 1.
4. [0 points] Despite what the data says, conclude that Manchester United is the best team in the Premier League!
Note: You are free to use Python or MATLAB for your implementation. You are NOT allowed to use any existing implementations of Metropolis-Hastings in this problem. Please include all the required results (figures + tables) in your writeup PDF submission, as well as submit your code to Gredescope separately.