Reading: Russell & Norvig, Chapter 13.

Your shopping cart is empty.

Reading: Russell & Norvig, Chapter 13.
1.1 Conditioningonbackgroundevidence
It is often useful to consider the impact of speciﬁc events in the context of general background evidence, rather than in the absence of information.
(a) Denoting such evidence by E, prove the conditionalized version of the product rule: P(X,Y|E) = P(X|Y,E)P(Y|E). (b) Also, prove the conditionalized version of Bayes rule:
P(X|Y,E) =
P(Y|X,E)P(X|E) P(Y|E)
.
(c) Also, prove the conditionalized version of marginalization: P(X|E) = X y P(X,Y =y|E).
1.2 Conditionalindependence
Show that the following three statements about random variables X, Y , and E are equivalent: (1) P(X,Y|E) = P(X|E)P(Y|E) (2) P(X|Y,E) = P(X|E) (3) P(Y|X,E) = P(Y|E) Inotherwords,showthat(1)implies(2)&(3),that(2)implies(1)&(3),andthat(3)implies(1)&(2). You should become ﬂuent with all these ways of expressing that X is conditionally independent of Y given E.
1.3 Creativewriting
Attach events to the binary random variables X, Y , and Z that are consistent with the following patterns of commonsense reasoning. You may use different events for the different parts of the problem.
(a) Cumulative evidence:
P(X=1) < P(X=1|Y =1) < P(X=1|Y =1,Z=1)
1
(b) Explaining away:
P(X=1|Y =1) P(X=1), P(X=1|Y =1,Z=1) < P(X=1|Y =1)
(c) Conditional independence: P(X=1,Y =1) 6= P(X=1)P(Y =1) P(X=1,Y =1|Z=1) = P(X=1|Z=1)P(Y =1|Z=1)
1.4 BayesRule
Supposethat1%ofcompetitivecyclistsuseperformance-enhancingdrugsandthataparticulardrugtesthas a 5% false positive rate and a 10% false negative rate.
(a) Let D ∈ {0,1} indicate whether a cyclist is doping, and let T ∈ {0,1} indicate the outcome of the drug test. Draw the belief network for these random variables, and use the information above to deduce the (conditional) probability tables for P(D) and P(T|D). (b) Cyclist A tests negative for drug use. What is the probability that Cyclist A is not using drugs?
(c) Cyclist B tests positive for drug use. What is the probability that Cyclist B is using drugs?
1.5 Entropy (a) Let X be a discrete random variable with P(X =xi) = pi for i ∈{1,2,...,n}. The entropyH[X] of the random variable X is a measure of its uncertainty. It is deﬁned as:
H[X] = −
n X i=1
pi logpi.
ShowthattheentropyH[X]ismaximizedwhenpi = 1 n foralli. Youshoulddothisbycomputingthe gradient with respect to pi and using Lagrange multipliers to enforce the constraint thatPi pi = 1. Later in the course, we will use similar calculations for learning probabilistic models.
(b) The joint entropy of n discrete random variables (X1,X2,...,Xn) is deﬁned as: H(X1,X2,...,Xn) = −X x1 X x2 ...X xn P(x1,x2,...,xn)logP(x1,x2,...,xn), wherethesumsrangeoverallpossibleinstantiationsofX1,X2,...,Xn. ShowthatifthevariablesXi are independent, then their joint entropy is the sum of their individual entropies: namely,
P(x1,x2,...,xn) =
n Y i=1
P(xi) implies H(X1,X2,...,Xn) =
n X i=1
H(Xi).
2
1.6 Kullback-Leiblerdistance Considertwodiscreteprobabilitydistributions,pi andqi,withPpi =Pqi = 1. TheKullback-Leibler(KL) distance between these distributions (also known as the relative entropy) is deﬁned as: KL(p,q) =X i pi log(pi/qi). (a) Consider the natural logarithm (in base e). By sketching graphs of log(x) and x − 1, verify the inequality: log(x) ≤ x−1, with equality if and only if x=1. Conﬁrm this result by differentiation of log(x)−(x−1). (b) Use the previous result to prove that KL(p,q) ≥ 0, with equality if and only if the two distributions pi and qi are equal. (c) Using the inequality in (a), as well as the simple equality logx = 2log√x, derive the tighter lower bound: KL(p,q) ≥ X i (√pi −√qi)2. (d) Provide a counterexample to show that the KL distance is not a symmetric function of its arguments: KL(p,q) 6= KL(q,p). Despite this asymmetry, it is still common to refer to KL(p,q) as a measure of distance. Many algorithms for machine learning are based on minimizing KL distances between probability distributions.
1.7 Mutualinformation
The mutual information I(X,Y ) between two discrete random variables X and Y is deﬁned as I(X,Y ) = X x X y P(x,y)log? P(x,y) P(x)P(y)?, wherethesumisoverallpossiblevaluesoftherandomvariablesX andY . Notehowthemutualinformation is related to the deﬁnitions in the previous two problems.
(a) Prove that the mutual information is nonnegative.
(b) Show that the mutual information I(X,Y ) vanishes if and only if X and Y are independent random variables. (Thus, I(X,Y ) provides one quantitative measure of dependence between X and Y .)
3
1.8 Compareandcontrast
Consider the different belief networks (BNs) shown below for the discrete random variables X, Y , and Z.
belief network #3belief network #1 belief network #2
ZY
X
ZY
X
ZY
X
(a) Does the ﬁrst belief network imply a statement of marginal or conditional independence that is not implied by the second? If yes, provide an example.
(b) Doesthesecondbeliefnetworkimplyastatementofmarginalorconditionalindependencethatisnot implied by the third? If yes, provide an example.
(c) Does the third belief network imply a statement of marginal or conditional independence that is not implied by the ﬁrst? If yes, provide an example.
1.9 Hangman
Consider the belief network shown below, where the random variable W stores a ﬁve-letter word and the random variable Li ∈ {A,B,...,Z} reveals only the word’s ith letter. Also, suppose that these ﬁve-letter words are chosen at random from a large corpus of text according to their frequency:
P(W =w) =
COUNT(w) Pw0 COUNT(w0),where COUNT(w) denotes the number of times that w appears in the corpus and where the denominator is a sum over all ﬁve-letter words. Note that in this model the conditional probability tables for the random variables Li are particularly simple: P(Li=`|W =w) = ( 1 if ` is the ith letter of w, 0 otherwise.
Nowimagineagameinwhichyouareaskedtoguesstheword w oneletteratatime. Therulesofthisgame are as follows: after each letter (A through Z) that you guess, you’ll be told whether the letter appears in the word and also where it appears. Given the evidence that you have at any stage in this game, the critical question is what letter to guess next.
4
L1 L2
W
L3 L4 L5
Let’s work an example. Suppose that after three guesses—the letters D, I, M—you’ve learned that the letter I does not appear, and that the letters D and M appear as follows:
M D M
Now consider your next guess: call it `. In this game the best guess is the letter ` that maximizes P?L2=` or L4=`? ? ?L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M}?.In other works, pick the letter ` that is most likely to appear in the blank (unguessed) spaces of the word. For any letter ` we can compute this probability as follows: P?L2=` or L4=`? ? ?L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M}?= X w P?W =w,L2=` or L4=`? ? ?L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M}?, marginalization= X w P(W =w|L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M}?P(L2=` or L4=`|W =w) productrule&CI where in the third line we have exploited the conditional independence (CI) of the letters Li given the word W. Inside this sum there are two terms, and they are both easy to compute. In particular, the second term is more or less trivial: P(L2=` or L4=`|W =w) = ( 1 if ` is the second or fourth letter of w 0 otherwise.
And the ﬁrst term we obtain from Bayes rule: P(W =w|L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M}?
=
P(L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M}|W =w?P(W =w) P(L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M})
Bayesrule
In the numerator of Bayes rule are two terms; the left term is equal to zero or one (depending on whether the evidence is compatible with the word w), and the right term is the prior probability P(W = w), as
5
determined by the empirical word frequencies. The denominator of Bayes rule is given by: P(L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M}) = X w P(W =w,L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M}), marginalization = X w P(W =w)P(L1=M,L3=D,L5=M,L26∈{D,I,M},L46∈{D,I,M}|W =w), productrule where again all the right terms inside the sum are equal to zero or one. Note that the denominator merely sums the empirical frequencies of words that are compatible with the observed evidence. Now let’s consider the general problem. Let E denote the evidence at some intermediate round of the game: in general, some letters will have been guessed correctly and their places revealed in the word, while other letters will have been guessed incorrectly and thus revealed to be absent. There are two essential computations. The ﬁrst is the posterior probability, obtained from Bayes rule:
P(W =w|E) =
P(E|W =w)P(W =w)P w0 P(E|W =w0)P(W =w0)
.
The second key computation is the predictive probability, based on the evidence, that the letter ` appears somewhere in the word: P?Li=`for somei∈{1,2,3,4,5}? ? ?E? = X w P?Li=`for somei∈{1,2,3,4,5}? ? ?W =w?P?W =w? ? ?E?. Note in particular how the ﬁrst computation feeds into the second. Your assignment in this problem is implement both of these calculations. Youmayprograminthelanguageofyourchoice.
(a) Download the ﬁle hw1 word counts 05.txt that appears with the homework assignment. The ﬁle contains a list of 5-letter words (including names and proper nouns) and their counts from a large corpus of Wall Street Journal articles (roughly three million sentences). From the counts in this ﬁle compute the prior probability P(w) = COUNT(w)/COUNTtotal. As a sanity check, print out the ﬁfteen mostfrequent 5-letter words, as well as the fourteen leastfrequent 5-letter words. Do yourresultsmakesense? (b) Consider the following stages of the game. For each of the following, indicate the best next guess— namely, the letter ` that is most likely (probable) to be among the missing letters. Also report the probability P(Li =` for some i∈{1,2,3,4,5}|E) for your guess `. Your answers should ﬁll in the last two columns of this table. (Some answers are shown so that you can check your work.) correctly guessed incorrectly guessed best next guess ` P(Li=` for some i∈{1,2,3,4,5}|E) – – – – – {} – – – – – {E, A} A – – – S {} A – – – S {I} – – O – – {A, E, M, N, T} – – – – – {E, O} I 0.6366 D – – I – {} A 0.8207 D – – I – {A} E 0.7521 – U – – – {A, E, I, O, S} Y 0.6270 6
(c) Turn in a hard-copy printout of your source code. Do not forget the source code: it is worth many points on this assignment.
More fun: The demo on Piazza (also under resources) implements this program for words of length 6-10. You will also ﬁnd count ﬁles for words of these lengths on Piazza, and if you modify your code to handle these different word lengths, you will also be able to check your answers against the demo. (This is totally optional, though.) Just to be perfectly clear, you are not required in this problem to implement a user interface or any general functionality for the game of hangman. You will only be graded on your word lists in (a), the completed table for (b), and your source code in (c).

Shopping cart

US$0

Reading: Russell & Norvig, Chapter 13.

More products