Starting from:

$29

CSE 250a. Assignment 4

CSE 250a. Assignment 4

4.1 Gradient-based learning
X1 X2 X3 Xd ...
Y
Consider the belief network shown above with binary random variable Y 2 {0, 1} and conditional probability table (CPT):
P(Y = 1|x1, x2,...,xd) = f
X
d
i=1
wixi
!
= f(w~ · ~x).
Here, we assume that the weights wi 2 < are real-valued parameters to be estimated from data and that
f :< ! [0, 1] is a differentiable function that maps its real-valued argument to the unit interval.
Consider a data set of i.i.d. examples {(~xt, yt)}T
t=1 where ~xt = (x1t, x2t,...,xdt) denotes the observed
vector of values from the t
th example for the network’s inputs. Also, as shorthand, let pt = P (Y = 1|~xt).
(a) Show that the gradient of the conditional log-likelihood L = P
t log P(yt|~xt) is given by:
@L
@wi
= X
T
t=1
 f0
(w~ · ~xt)
pt(1 pt)

(yt pt)xit,
where f0
(z) denotes the first derivative of f(z). Intuitively, this result shows that the differences between observed values yt and predictions pt appear as error signals for learning.
(b) Show that the general result in part (a) reduces to the result in lecture when f(z) = [1 + ez]
1 is the
sigmoid function.
1
4.2 Multinomial logistic regression
X1 X2 X3 Xd ...
Y
A simple generalization of logistic regression is to predict a discrete (but non-binary) label y 2 {1, 2,...,c}
from a real-valued vector ~x 2 Rd. Here c is the number of classes. For the belief network shown above,
consider the following parameterized conditional probability table (CPT):
P(Y =i|X~ =~x) = ew~i·~x
Pc
j=1 ew~j ·~x .
The parameters of this CPT are the weight vectors w~i which must be learned for each possible label. The
denominator normalizes the distribution so that the elements of the CPT sum to one.
Consider a data set of T examples {(~xt, yt)}T
t=1, which you can assume to be identically, independently
distributed (i.i.d.). As shorthand, let yit 2 {0, 1} denote the c⇥T matrix of target assignments with elements
yit =
(
1 if yt = i,
0 otherwise.
Also, let pit 2 [0, 1] denote the conditional probability that the model classifies the tth example by the ith
possible label:
pit = ew~i·~xt
Pc
j=1 ew~j ·~xt
.
The weight vectors can be obtained by maximum likelihood estimation using gradient ascent. Show that the
gradient of the conditional log-likelihood L = P
t log P(yt|~xt) is given by:
@L
@ ~wi
= X
t
(yit pit) ~xt.
Again, this result shows that the differences between observed values yit and predictions pit appear as error
signals for learning.
2
4.3 Convergence of gradient descent
One way to gain intuition for gradient descent is to analyze its behavior in simple settings. For a onedimensional function f(x) over the real line, gradient descent takes the form:
xn+1 = xn ⌘f0
(xn).
(a) Consider minimizing the function f(x) = ↵
2 (xx⇤)2 by gradient descent, where ↵0. (In this case,
we know that the minimum occurs at x = x⇤; our goal is to analyze the rate of convergence to this
minimum.) Derive an expression for the error "n = xn x⇤ at the nth iteration in terms of the initial
error "0 and the step size ⌘0.
(b) For what values of the step size ⌘ does the update rule converge to the minimum at x⇤? What step
size leads to the fastest convergence, and how is it related to f00(xn)?
In practice, the gradient descent learning rule is often modified to dampen oscillations at the end of the
learning procedure. A common variant of gradient descent involves adding a so-called momentum term:
~xn+1 = ~xn ⌘rf + (~xn ~xn1),
where 0. Intuitively, the name arises because the optimization continues of its own momentum (stepping
in the same direction as its previous update) even when the gradient vanishes. In one dimension, this learning
rule simplifies to:
xn+1 = xn ⌘f0
(xn) + (xnxn1).
(c) Consider minimizing the quadratic function in part (a) by gradient descent with a momentum term.
Again, let "n =xn x⇤ denote the error at the nth iteration. Show that the error in this case satisfies
the recursion relation:
"n+1 = (1 ↵⌘ + )"n "n1.
(d) Suppose that the second derivative f00(x⇤) is given by ↵ = 1, the learning rate by ⌘ = 4
9 , and the
momentum parameter by = 1
9 . Show that one solution to the recursion in part (c) is given by:
"n = n"0,
where "0 is the initial error and is a numerical constant to be determined. (Other solutions are also
possible, depending on the way that the momentum term is defined at time t = 0; do not concern
yourself with this.) How does this rate of convergence compare to that of gradient descent with the
same learning rate (⌘ = 4
9 ) but no momentum parameter ( = 0)?
3
4.4 Newton’s method
One way to gain intuition for Newton’s method is to analyze its behavior in simple settings. For a twicedifferentiable function f(x) over the real line, Newton’s method takes the form:
xn+1 = xn f0
(xn)
f00(xn)
.
(a) Consider the polynomial function f(x)=(xx⇤)2p for positive integers p, whose minimum occurs
at x=x⇤. Suppose that Newton’s method is used to minimize this function, starting from some initial
estimate x0. Derive an expression for the error "n = |xnx⇤| at the nth iteration in terms of the initial
error "0.
(b) For the function in part (a), how many iterations of Newton’s method are required to reduce the initial
error by a constant factor < 1, such that "n  "0? Starting from your previous answer, show that
n (2p1) log(1/) iterations are sufficient. (Hint: use the inequality that log z  z1 for z 0.)
(c) Consider the function f(x) = x⇤ log(x⇤/x) x⇤ + x, where x⇤ 0. Show that the minimum occurs
at x=x⇤, and sketch the function in the region |x x⇤| < x⇤.
(d) Consider minimizing the function in part (c) by Newton’s method. Derive an expression for the
relative error ⇢n = (xnx⇤)/x⇤ at the nth iteration in terms of the initial relative error ⇢0. Note the
rapid convergence (which is typical of Newton’s method). For what range of initial values (for x0)
does Newton’s method converge to the correct answer?
4
4.5 Stock market prediction
In this problem, you will apply a simple linear model to predicting the stock market. From the course web
site, download the files nasdaq00.txt and nasdaq01.txt, which contain the NASDAQ indices at the
close of business days in 2000 and 2001.
2000 2001 1K
2K
3K
4K
5K
6K
year
price
NASDAQ
TRAIN TEST
(a) Linear coefficients
How accurately can the index on one day be predicted by a linear combination of the three preceding
indices? Using only data from the year 2000, compute the linear coefficients (a1,a2,a3) that maximize
the log-likelihood L = P
t log P(xt|xt1, xt2, xt3), where:
P(xt|xt1, xt2, xt3) = 1
p2⇡
exp "
1
2

xt a1xt1 a2xt2 a3xt3
◆2
#
,
and the sum is over business days in the year 2000 (starting from the fourth day).
(b) Mean squared prediction error
For the coefficients estimated in part (a), compare the model’s performance (in terms of mean squared
error) on the data from the years 2000 and 2001. Would you recommend this linear model for stock
market prediction?
(c) Source code
Turn in a print-out of your source code. You may program in the language of your choice, and you
may solve the required system of linear equations either by hand or by using built-in routines (e.g., in
Matlab, NumPy).
5
4.6 Handwritten digit classification
In this problem, you will use logistic regression to classify images of handwritten digits. From the course
web site, download the files train3.txt, test3.txt, train5.txt, and test5.txt. These files
contain data for binary images of handwritten digits. Each image is an 8x8 bitmap represented in the files
by one line of text. Some of the examples are shown in the following figure.
(a) Training
Perform a logistic regression (using gradient ascent or Newton’s method) on the images in files
train3.txt and train5.txt. Indicate clearly the algorithm used, and provide evidence that
it has converged (or nearly converged) by plotting or printing out the log-likelihood on several iterations of the algorithm, as well as the percent error rate on the images in these files. Also, print out the
64 elements of your solution for the weight vector as an 8x8 matrix.
(b) Testing
Use the model learned in part (a) to label the images in the files test3.txt and test5.txt.
Report your percent error rate on these images.
(c) Source code
Turn in a print-out of your source code. Once again, you may program in the language of your
choice. You should write your own routines for computing the model’s log-likelihood, gradient,
and/or Hessian, as well as for updating its weight vector.
6

More products