$30
CS559-B: Machine Learning
Final Exam
Final exam will be done individually. Use of partial or entire solutions obtained from
others or online is strictly prohibited.
• There will be 7 pages in this exam (including this cover sheet and scratch paper)
• This is a SEMI-OPEN-BOOK exam. You can use lecture notes as reference, but do not search the
internet for answer.
• Work efficiently and independently.
• You have 150 minutes.
• Good luck!
Question Topic Max. score Score
1 Short Questions 30
2 Support Vector Machine 10
3 Boosting Machine 10
4 Bayesian Network 20
5 Decision Tree 15
6 Neural Network and Back-propagation 15
Total 100
Page 1
1. Short Questions (30 points)
(a) (2 pts) True or False: Bagging algorithms attach weights α1, α2, ...αN to a set of N weak learners.
They re-weight the learners and convert them into strong ones. Boosting algorithms draw N sample
distributions (usually with replacement) from an original data set for learners to train on.
(b) (2 pts) True or False: An infinite depth binary Decision Tree can always achieve 100% training
accuracy, provided that no point is mislabelled in the training set.
(c) (2 pts) True or False: A neural network with multiple hidden layers and sigmoid nodes can form
non-linear decision boundaries.
(d) (2 pts) True or False: A random forest is an ensemble learning method that attempts to lower the
bias error of decision trees.
(e) (2 pts) Assume that you initialize all weights in a neural net to the same value and you do the same
for the bias terms. Which of the following statements is correct.
(a) This is a good idea since it treats every edge equally.
(b) This is a bad idea.
(f) (3 pts) Which of the following statements are true?
(a) The more training examples, the more accurate the prediction of a k-nearest-neighbor classifier
(b) k-nearest-neighbors cannot be used for regression.
(c) A k-nearest-neighbor classifier is sensitive to outliers.
(d) Training a k-nearest-neighbor classifier takes more computational time than applying it / using
it for prediction.
(g) (3 pts) Consider the following joint distribution on X and Y , where both random variables take
on the values 0, 1: p(X = 0, Y = 0) = 0.1, p(X = 0, Y = 1) = 0.2, p(X = 1, Y = 0) = 0.3,
p(X = 1, Y = 1) = 0.4. You receive X = 1. What is the largest probability of being correct you
can achieve when predicting Y in this case?
(a) 1
3
(b) 3
4
(c) 1
7
(d) 0 (e) 1 (f) 2
3
(g) 6
7
(h) 4
7
(i) 3
7
(j) 1
4
(k) 2
4
(h) (3 pts) Consider the K-means algorithm. Which of the following assertions is wrong?
(a) Regardless of the initialization the algorithm converges.
(b) Regardless of the initialization the algorithm always finds the same clusters.
(c) If we initialize the K-means algorithm with optimal clusters then it will find in one step optimal
representation points.
(d) If we initialize the K-means algorithm with optimal representation points then it will find in
one step optimal clusters.
(i) (3 pts) What strategies can help reduce overfitting in decision trees?
(a) Pruning
(b) Enforce a minimum number of samples in leaf nodes
(c) Make sure each leaf node is one pure class
(d) Enforce a maximum depth for the tree
Page 2
(j) (3 pts) In the setting of EM algorithm, where xn is the data and zn is the latent variable, what
quantity is called the posterior?
(a) p(xn|zn, θ)
(b) p(xn, zn|θ)
(c) p(zn|xn, θ)
(k) (5 pts) Suppose we clustered a set of N data points using two different clustering algorithms: kmeans and Gaussian mixtures. In both cases we obtained 5 clusters and in both cases the centers of
the clusters are exactly the same. Can 3 points that are assigned to different clusters in the kmeans
solution be assigned to the same cluster in the Gaussian mixture solution? If no, explain. If so,
sketch an example or explain in 1-2 sentences.
2. Support Vector Machine (SVM) (10 points)
(a) (3 pts) Consider the three linearly separable two-dimensional input vectors in the following figure.
Find the linear SVM that optimally separates the classes by maximizing the margin. You only need
to draw on the figure.
(b) (2 pts) In the solution for (a), how many support vectors?
(c) (5 pts) What are the advantages of SVM?
Page 3
3. Boosting Machine-Adaboost (10 points)
Figure 1: h1 is chosen at the first iteration of boosting. Points above the h1 are predicted to be negative
while below the h1 are predicted to be positive.
(a) (5 pts) The above figure shows a dataset of 8 points, equally divided among the two classes (positive
and negative). The figure also shows a particular choice of decision line h1 picked by AdaBoost in
the first iteration. What is the weight α1 that will be assigned to h1 by AdaBoost? (Initial weights
of all the data points are equal, or 1
8
.)
(b) (5 pts) AdaBoost will eventually reach zero training error, regardless of the type of weak classifier
it uses, provided enough weak classifiers have been combined. (True or False, briefly explain)
Page 4
4. Bayesian Network (15 points)
(a) (5 pts) Draw a Bayesian network which represents the following joint distribution:
P(A, B, C, D, E) = P(A)P(B|A)P(C|A)P(D|A, B)P(E|A, C, D)
(b) (5 pts) How many independent parameters are needed to fully specify the joint distribution in (a).
(c) (3 pts) Suppose we do not have any independence assumption over all 5 random variables A, B, C, D, E,
write down one possible factorization of P(A, B, C, D, E).
(d) (2 pts) How many independent parameters are needed to fully specify the joint distribution in (c).
Page 5
5. Decision Tree (15 points) We will use the dataset below to learn a decision tree which predicts if people
pass machine learning (Yes or No), based on their previous GPA (High, Medium, or Low) and whether
or not they studied.
GPA Studied Passed (Y)
L F F
L T T
M F F
M T T
H F T
H T T
For this problem, write your answer using log2
, it may be helpful to note that log2 3 ≈ 1.6.
(a) (2 pts) What is the entropy H(Y )?
(b) (3 pts) What is the entropy H(Y |GPA)?
(c) (5 pts) What is the entropy H(Y |Studied)?
(d) (5 pts) Draw the full decision tree that would be learned for this data set. You do not need to show
any calculations.
Page 6
6. Neural Network and Back-propagation (15 points) You want to train your neural network to predict
the score of exam based on inputs of how many hours we studied and how many hours we slept the day
before. Your neural network consists of an input layer with 2 units (hours studied and hours slept), a
hidden layer with 3 units and an output layer with 1 unit (the score of exam). You use the sigmoid
activation function for the hidden units and no activation function for the outputs (or inputs). We use
the following notations:
• x is the training input vector, y is the true label vector(in this case, the given exam scores), yˆ is
the output of your neural network. All vectors are column vectors. Note that vector x has two
elements, y and yˆ has only one element (i.e., scalar).
• Sigmoid function is defined as: σ(x) = 1
1+e−x . σ(.) is applied element-wise to a vector. The
derivative of sigmoid function is σ
0
(x) = dσ(x)
dx = σ(x)(1 − σ(x))
• g is the vector of hidden unit values before the sigmoid activation functions are applied, h = σ(g)
is the vector of hidden unit values after they are applied.
• V , bV are the weight matrix and bias terms that map the input layer to the hidden layer, i.e.,
g = V x + bV .
• W, bW are the weight matrix and bias terms that map the hidden layer to the output layer, i.e.,
yˆ = Wh + bW .
(a) (5 pts) What function does this one-layer neural network represents? Write down the function
expression for yˆ in terms of input x and all related weight/bias parameters.
(b) (5 pts) Calculate the number of parameters (weights) in this network. You can leave your answer
as an expression. Be sure to account for the bias terms. (Hint: consider the size of the weight
matrices (i.e., V and W) and bias terms (i.e., bV and bW ) )
(c) (5 pts) Suppose you train your network with cost function J =
1
2
|y − yˆ|
2
. What is ∂J
∂W ? (Hint: ∂J
∂W
is the matrix/vector with the same dimension as W, express the gradient ∂J
∂W in terms of proper
vector product using yˆ, y and h. )
Page 7