Starting from:

$29.99

Assignment 3: probabilistic inference

Assignment 3: Probability
This assignment will give you a chance to practice probabilistic inference for some real-world problems
Part 1: Natural Language Processing Natural language processing (NLP) is an important research area in artificial intelligence, with a history dating back to at least the 1950’s. One of the most basic problems in NLP is part-of-speech tagging, in which the goal is to mark every word in a sentence with its part of speech (noun, verb, adjective, etc.). This is
1
S1 S2 S3 S4 SN
W1 W2 W3 W4 WN
... S1 S2 S3 S4 SN
W1 W2 W3 W4 WN
... S1 S2 S3 S4 SN
W1 W2 W3 W4 WN
...
(a) (b) (c)
Figure 1: Three Bayes Nets for part of speech tagging: (a) an HMM, (b) a simplified model, and (c) a more complicated model.
a first step towards extracting semantics from natural language text. For example, consider the following sentence:
Her position covers a number of daily tasks common to any social director.
Part-of-speech tagging here is not easy because many of these words can take on different parts of speech depending on context. For example, position can be a noun (as in the above sentence) or a verb (as in “They position themselves near the exit”). In fact, covers, number, and tasks can all be used as either nouns or verbs, while social and common can be nouns or adjectives, and daily can be an adjective, noun, or adverb. The correct labeling for the above sentence is:
Her position covers a number of daily tasks common to any social director. DET NOUN VERB DET NOUN ADP ADJ NOUN ADJ ADP DET ADJ NOUN
where DET stands for a determiner, ADP is an adposition, ADJ is an adjective, and ADV is an adverb.1 Labeling parts of speech thus involves an understanding of the intended meaning of the words in the sentence, as well as the relationships between the words.
Fortunately, it turns out that a relatively simple Bayesian network can model the part-of-speech problem surprisingly well. In particular, consider the Bayesian network shown in Figure 1(a). This Bayes net has a set of N random variables S = {S1,...,SN} and N random variables W = {W1,...,WN}. The W variables represent observed words in a sentence, so Wi ∈{w|w is a word in the English language}. The variables in S represent part of speech tags, so Si ∈{VERB,NOUN,...}. The arrows between W and S nodes model the probabilistic relationship between a given observed word and the possible parts of speech it can take on, P(Wi|Si). (For example, these distributions can model the fact that the word “dog” is a fairly common noun but a very rare verb. The arrows between S nodes model the probability that a word of one part of speech follows a word of another part of speech, P(Si+1|Si). (For example, these arrows can model the fact that verbs are very likely to follow nouns, but are unlikely to follow adjectives.)
We can use this model to perform part-of-speech tagging as follows. Given a sentence with N words, we construct a Bayes Net with 2N nodes as above. The values of the variables W are just the words in the sentence, and then we can perform Bayesian inference to estimate the variables in S.
Your goal in this part is to implement a part-of-speech tagger in Python, using Bayesian networks.
Data. To help you get started, we’ve prepared a large corpus of labeled training and testing data, consisting of nearly 1 million words and 50,000 sentences. The file format of the datasets is quite simple: each line consists of a word, followed by a space, followed by one of 12 part-of-speech tags: ADJ (adjective), ADV (adverb), ADP (adposition), CONJ (conjunction), DET (determiner), NOUN, NUM (number), PRON
1If you didn’t know the term “adposition”, neither did I. The adpositions in English are prepositions; in many languages, there are postpositions too. But you won’t need to understand the lingustic theory between these parts of speech to complete the assignment; if you’re curious, you might check out the “Part of Speech” Wikipedia article for some background.
2
(pronoun), PRT (particle), VERB, X (foreign word), and . (punctuation mark). Sentence boundaries are indicated by blank lines.2
1. First you’ll need to estimate the conditional probability tables encoded in the Bayesian network above, namely P(S1), P(Si+1|Si), and P(Wi|Si). To do this, use the labeled training corpus file we’ve provided.
2. Your goal now is to label new sentences with parts of speech, using the probability distributions learned in step 1. To get started, consider the simplified Bayes net in Figure 1(b). To perform part-of-speech tagging, we’ll want to estimate the most-probable tag s∗ i for each word Wi, s∗ i = argmax si P(Si = si|W).
Implement part-of-speech tagging using this simple model, and display the result as well as the actual marginal probability P(Si = s∗ i|W) for each chosen state. (Note that this probability is like a “confidence” score, indicating how certain the algorithm feels about each word’s part of speech.) Hint: This is easy; if you don’t see why, try running Variable Elimination by hand on the Bayes Net in Figure 1(b).
3. Now consider the Bayes net of Figure 1a, which is a better model that incorporates dependencies between words. Implement the Viterbi algorithm to find the maximum a posteriori (MAP) labeling for the sentence – i.e. the most likely state sequence:
(s∗ 1,...,s∗ N) = arg max s1,...,sN
P(Si = si|W).
4. Consider the Bayes Net of Figure 1c, which is a better model of language because it incorporates some longer-term dependencies between words. However, it’s no longer an HMM, so one can’t use Viterbi. Use Variable Elimination to estimate the best labeling for each word (by picking the maximum marginal for each word, s∗ i = argmaxsi P(Si = si|W), as in step 2), as well as the “confidence score” for eachword, P(Si = si|W).
Your program should be called label.py and should take as input two filenames: a training file and a testing file. The program should use the training corpus for Step 1, and then display the output of Steps 2 through 4 on each sentence in the testing file. It should also display the logarithm of the posterior probability for each solution it finds, as well as a running evaluation showing the percentage of words and whole sentences that have been labeled correctly according to the ground truth. For example:
[djcran@raichu djc-sol]$ python label.py training_file testing_file Learning model... Loading test data... Testing classifiers... : Magnus ab integro seclorum nascitur ordo . 0. Ground truth (-143.52): noun verb adv conj noun noun . 1. Simplified (-146.10): noun verb adv conj adv noun . : 1.00 1.00 1.00 1.00 0.26 0.74 . 2. HMM (-142.68): noun verb adv conj det noun . 3. Complex (-142.70): noun verb adv conj det noun . : 1.00 1.00 1.00 1.00 0.26 0.74 .
== So far scored 1 sentences with 17 words. Words correct: Sentences correct: 0. Ground truth: 100.00% 100.00%
2This dataset is based on the Brown corpus. Modern part-of-speech taggers often use a much larger set of tags – often over 100 tags, depending on the language of interest – that carry finer-grained information like the tense and mood of verbs, whether nouns are singular or plural, etc. In this assignment we’ve simplified the set of tags to the 12 described here; the simple tag set is due to Petrov, Das and McDonald, and is discussed in detail in their 2012 LREC paper if you’re interested.
3
(a) (b) (c)
Figure 2: Figure 1: (a) Sample input image, (b) corresponding edge strength map, and (c) highlighted ridge.
1. Simplified: 83.33% 0.00% 2. HMM: 83.33% 0.00% 3. Complex: 83.33% 0.00%
---
We’ve already implemented some skeleton code to get you started, in three files: label.py, which is the main program, pos scorer.py, which has the scoring code, and pos solver.py, which will contain the actual part-ofspeech estimation code. You should only modify the latter of these files; the current version of pos solver.py we’ve supplied is very simple, as you’ll see.
In your report, please make sure to include your results (accuracies) for each technique on the test file we’ve supplied, bc.test.
Part 2: Mountain finding
Given a photo, how can you determine where on Earth it was taken? If the photo has GPS metadata, this is of course trivial, and if it’s a photo of a major tourist attraction, it’s relatively easy to use image matching to compare to a library of many tourist sites. But what if it’s a photo like that of Figure 2a? It turns out that one relatively powerful cue is the shape of the mountains you see in the distance. The shape of these mountains potentially form a distinctive “fingerprint” that can be matched to a digital elevation map of the world, to determine (at least roughly) where the photographer was standing.
In this part, we’ll consider the very first step of solving this process automatically: identifying the shape of mountains in a photo. In particular, we’ll try to identify the ridgeline, i.e. the boundary between the sky and the mountain. We’ll assume relatively clean images like that of Figure 2a, where the mountain is plainly visible, there are no other objects obstructing the mountain’s ridgeline, the mountain takes up the full horizontal dimension of the image, and the sky is relatively clear. Under these assumptions, for each column of the image we need to estimate the row of the image corresponding to the boundary position. Plotting this estimated row for each column will give a ridgeline estimate.
We’ve given you some code that reads in an image file and produces an “edge strength map” that measures how strong the image gradient (local constrast) is at each point. We could assume that the stronger the image gradient, the higher the chance that the pixel lies along the ridgeline. So for an m×n image, this is a 2-d function that we’ll call I(x,y), measuring the strength at each pixel (x,y) ∈ [1,m]×[1,n] in the original image. Your goal is to estimate, for each column x ∈ [1,m], the row sx corresponding to the ridgeline. We can solve this problem using a Bayes net, where s1,s2,...sm correspond to the hidden variables, and the gradient data are the observed variables (specifically w1,w2,...wm, where w1 is a vector corresponding to column 1 of the gradient image).
1. Perhaps the simplest approach would be to use the Bayes Net in Figure 1b. Implement this technique, showing the result of the detected boundary with a red line in the output image.
4
2. A better approach would use the Bayes Net in Figure 1a. Use MCMC to do this, showing the result of the detected boundary with a blue line in the output image. (Hint: What should your transition probabilities look like? It’s up to you, but you probably want to use a distribution that encourages “smoothness,” i.e. that P(si+1|si) is high if si+1 = si and is low if they are very different. How about the emission probabilities? Again, this is for you to decide, but intuitively a higher gradient should correspond to higher probability.)
3. A simple way of improving the results further would be to incorporate some human feedback in the process. Assume that a human gives you a single (x,y) point that is known to lie on the ridgeline. Modify your answer to step 2 to incorporate this additional information. (Hint: you can do this by just tweaking the HMM’s probability distributions – no need to change the algorithms themselves!) Show the detected boundary in green.
Your program should be run like this:
python ./mountain.py input_file.jpg output_file.jpg row_coord col_coord
where row coord and col coord are the image row and column coordinates of the human-labeled pixel, and output file.png should show the original image with the red, green, and blue lines superimposed.
Run your code on our sample images (and feel free to try other sample images of your own), and please commit the output images to your repo.