Assignment6
6.1 Survey
This week you will receive a link to a survey on movies released since 2010; please complete it by Friday November 11. We are collecting this data for the next assignment in which you will build a simple movie recommendationsystem. ThesurveywillbesenttotheemailaddressthatislistedforyouonGradeSource.
6.2 EMalgorithm
A
CB
D
(a) Posteriorprobability Considerthebeliefnetworkshownabove,withobservednodes C and D andhiddennodes A and B. Compute the posterior probability P(a,b|c,d) in terms of the CPTs of the belief network—that is, in termsof P(a), P(b|a), P(c|a,b) and P(d|b,c).
(b) Posteriorprobability Compute the posterior probabilities P(a|c,d) and P(b|c,d) in terms of your answer from part (a); thatis,forthisproblem,youmayassumethat P(a,b|b,d) isgiven.
(c) Log-likelihood Consider a partially complete data set of i.i.d. examples{ct,d t}T t=1 drawn from the joint distribution oftheabovebeliefnetwork. Thelog-likelihoodofthedatasetisgivenby: L=X t logP(C=ct,D=dt). Compute this log-likelihood in terms of the CPTs of the belief network. You may re-use work from earlierpartsoftheproblem.
1
A
CB
D
(d) EMalgorithm Give the EM updates to estimate CPTs that maximize the log-likelihood in part (c); in particular, completethenumeratoranddenominatorinthebelowexpressionsfortheupdaterules. Simplifyyour answers as much as possible, expressing them in terms of the posterior probabilities P(a,b|ct,d t), P(a|ct,d t),and P(b|ct,d t),aswellasthefunctions I(c,ct),and I(d,dt).
P(A=a)
P(B=b|A=a)
P(C=c|A=a,B=b)
P(D=d|B=b,C=c)
2
6.3 EMalgorithmfornoisy-OR
Consider the belief network on the right, with binary random variables X 2{0,1}n and Y 2{ 0,1} and a noisy-OR conditional probability table (CPT). The noisyORCPTisgivenby:
P(Y =1 |X)=1
n Y i=1
(1pi)Xi ,
which is expressed in terms of the noisyORparameters pi2[0,1].
Y
X 1
X 2
X 3
X n
Inthisproblem,youwillderiveandimplementanEMalgorithmforestimatingthenoisy-ORparameterspi. It may seem that the EM algorithm is not suited to this problem, in which all the nodes are observed, and the CPT has a parameterized form. In fact, the EM algorithm can be applied, but first we must express the modelinadifferentbutequivalentform.
Consider the belief network shown to the right. In this network, a binary random variable Zi 2{0,1} intercedes betweeneachpairofnodes Xi and Y. Supposethat: P(Zi=1|Xi=0) = 0, P(Zi=1|Xi=1) = pi. Also,letthenodeY bedeterminedbythelogical-ORofZi. Inotherwords: P(Y =1|Z)=( 1 if Zi=1forany i, 0 if Zi=0forall i. Y Z 1 Z 2 Z 3
Z
n
X 1
X 2
X 3
X n
(a) Show that this “extended” belief network defines the same conditional distribution P(Y|X) as the originalone. Inparticular,startingfrom P(Y =1|X)= X Z2{0,1}n P(Y =1,Z|X), show that the right hand side of this equation reduces to the noisy-OR CPT with parameters pi. To performthismarginalization,youwillneedtoexploitvariousconditionalindependencerelations. (b) Consider estimating the noisy-OR parameters pi to maximize the (conditional) likelihood of the observeddata. The(normalized)log-likelihoodinthiscaseisgivenby:
L =
1 T
T X t=1
logP(Y =y(t)|X=~x (t)),
3
where ( ~x (t),y(t)) isthe tthjointobservationof X and Y,andwhereforconveniencewehavedivided the overall log-likelihood by the number of examples T. From your result in part (a), it follows that we can estimate the parameters pi in either the original network or the extended one (since in both networkstheywouldbemaximizingthesameequationforthelog-likelihood). Noticethatintheextendednetwork,wecanviewX andY asobservednodesandZ ashiddennodes. Thus in this network, we can use the EM algorithm to estimate each parameter pi, which simply definesonerowofthe“look-up”CPTforthenode Zi. Compute the posterior probability that appears in the E-step of this EM algorithm. In particular, for jointobservations x2{0,1}n and y2{0,1},useBayesruletoshowthat: P(Zi=1,X i=1|X=x,Y =y)= yxipi 1Qj(1pj)xj(c) Forthedataset { ~x (t),y(t)}T t=1,showthattheEMupdatefortheparameters pi isgivenby: pi 1 TiX t P⇣Zi=1,X i=1|X=x(t),Y=y(t)⌘, where Ti isthenumberofexamplesinwhich Xi=1. (Youshouldderivethisupdateasaspecialcase ofthegeneralformpresentedinlecture.) (d) Download the data files on the course web site, and use the EM algorithm to estimate the parameters pi. The data set1 has T = 267 examples over n = 23 inputs. To check your solution, initialize all pi =0 .05 and perform 256 iterations of the EM algorithm. At each iteration, compute the loglikelihoodshowninpart(b). (IfyouhaveimplementedtheEMalgorithmcorrectly,thislog-likelihood will always increase from one iteration to the next.) Also compute the number of mistakes M T made by the model at each iteration; a mistake occurs either when yt =0and P(yt =1 | ~x t) 0.5(indicating a false positive) or when yt =1and P(yt =1 | ~x t) 0.5 (indicating a false negative). The number of mistakes should generally decrease as the model is trained, though it is not guaranteed to dosoateachiteration. Completethefollowingtable: iteration numberofmistakes M log-likelihoodL 0 175 -0.95809 1 56 2 -0.40822 4 8 16 32 64 37 128 256 -0.31016 Youmayusethealreadycompletedentriesofthistabletocheckyourwork. (e) Turninyoursourcecode. Asalways,youmayprograminthelanguageofyourchoice.
1For those interested, more information about this data set is available at http://archive.ics.uci.edu/ml/datasets/SPECT+Heart. However,besuretousethedatafilesprovidedonthecoursewebsite,astheyhavebeenspeciallyassembledforthisassignment.
4
6.4 Auxiliaryfunction
In class we derived an auxiliary function for optimizing the log-likelihood in belief networks with hidden variables. In this problem you will derive an auxiliary function for optimizing a simpler function that is nearlyquadraticnearitsminimum,butnearlylinearfarawayfromitsminimum.
(a) Considerthefunction f(x) = logcosh(x). Showthattheminimumoccursat x =0 . (b) Showthat f00(x)1 forall x. (c) ConsiderthefunctionQ(x,y)=f(y)+f0(y)(xy)+1 2(xy)2. Plotf(x), Q(x,2),andQ(x,3)asafunctionof x. (d) Provethat Q(x,y) isanauxiliaryfunctionfor f(x). Inparticular,showthatitsatisfies: (i) Q(x,x)=f(x) (ii) Q(x,y)f(x) Hint: usepart(b),andnotethat f(x)=f(y)+Rx y duf0(u)=f(y)+Rx y duhf0(y)+Ru y dvf00(v)i. (e) Derivetheformoftheupdaterule xn+1 = argminxQ(x,xn). (f) Write a simple program to show that your update rule in (e) converges numerically for the initial guesses x0 =2 and x0 =3 . Turninyoursourcecodeaswellasplotsof xn versus n. (g) Repeatparts(e)and(f)usingtheupdateruleforNewton’smethod: namely,xn+1 = xnf0(xn)/f00(xn). What happens and why? Determine an upper bound on |x0| so that Newton’s method converges. (Hint: require|x1|<|x0|.) (h) Plotthefunction g(x)= 1 10P10 k=1 logcosh⇣x + 2 pk⌘. Isitstillsimpletofindtheexactminimum? (i) Consider the function R(x,y)=g(y)+g0(y)(xy)+1 2(xy)2. Prove that R(x,y) is an auxiliaryfunctionfor g(x). (j) Derivetheformoftheupdaterule xn+1 = argminxR(x,xn). (k) Use the update rule from part (j) to locate the minimum of g(x) to four significant digits. In addition toyouranswerfortheminimum,turninyoursourcecodeaswellasplotsof xn versus n.