Starting from:

$27

MACHINE LEARNING HOMEWORK 3

MACHINE LEARNING COMS 4771, HOMEWORK 3

1 Problem 1 (5 points)
Suppose you work at a casino and you are responsible for making sure no one is cheating. Specifically,
you are interested in making sure the dice used in the craps tables are fair.
Craps is a game that involves throwing two 6-sided dice each round. For this exercise let’s consider
a simplified version of this game:
• Each round a player throws a pair of dice.
• If the sum of the outcomes of the dice is 7, the player wins, otherwise the player loses.
The casino has noticed that some players are secretly replacing the regular (fair) dice with a pair of
loaded (unfair) dice. When you throw an unfair die there is 1/3 probability the outcome being a 3,
1/3 probability of being a 4, and the other outcomes are all equally likely. Historic data supports
that there is a 1/1000 probability that a craps table has unfair dice at any point in time. The plan
of action of the casino is to replace both dice of a table when there is more than 50% probability
that they are unfair.
The casino manager asks you two things:
1. A mathematical function that receives the number of rounds played in a table, and outputs
the minimum number of wins of a player, such that the dice are most likely unfair.
2. What is the minimum number of rounds they have to observe so they can infer a pair of
dice is more likely to be unfair? What are the necessary outcomes of the dice that result in
the minimum number of observations needed for the inference?
2 Problem 2 (10 points)
For this problem, suppose we have a coin that when tossed, has probability µ of landing on heads
(and probability 1 − µ of tails). However, we do not know µ.
Consider two possible prior distributions: (A) µ ∼ Uniform[0, 1]; (B) we have some reason to think
the coin is likely to be fair so assume µ has a probability distribution which has the form of a concave
parabola with its maximum at 1
2
and falls to 0 at 0 and 1.
Along with 2 possible realizations: (1) D1 = {H, T}; (2) D2 = {T, T, T}.
For each of the 4 combinations of priors and realizations, derive with proof:
• p(H) given the prior
• the posterior distribution p(µ|D) [ensure this is properly normalized]
• the maximum likelihood estimate µML given the data D
• the MAP estimate µMAP given the data D
• p(H|D), i.e. the full Bayesian probability that the next toss is H
• the variance of the posterior distribution p(µ|D)
Put the final result of all items for all combination of priors and realizations in a single table.
Give one reason why the maximum likelihood estimate might not be very reliable in this context.
Use the results you calculated to support your argument.
3 Problem 3 (5 points): Conditional Independence
Prove or disprove with a counterexample. For random variables a, b, c:
1. a ⊥⊥ b|c → a ⊥⊥ b.
2. a ⊥⊥ b → a ⊥⊥ b|c
3. (a ⊥⊥ b) ∧ (b ⊥⊥ c) → a ⊥⊥ c
4 Problem 4 (10 points): Maximum Likelihood
The Gamma distribution is a probability density over non-negative scalars, x ≥ 0. One particular
form is as follows
p(x|α, λ) = 1
Γ(α)
λ
αx
α−1
e
−λx
The density has two scalar parameters λ, α 0.
What is the maximum likelihood estimate of λ from a dataset x1, · · · , xn of n iid samples, given
that α is fixed?
NOTE: make sure you show that your answer is a maximum point of the likelihood, and not a
minimum.
5 Problem 5 (20 points): Kernel Logistic Regression
Load dataset ’data1.mat’. Implement kernel logistic regression with an `
2
regularizer using the
empirical kernel map. In other words, minimize:
J(w) = −
X
N
i=1
log(σ(yiw
ki)) + λww
to get w. Here ki
is a column vector such that ki = [k(xi
, x1), · · · , k(xi
, xj ), · · · , k(xi
, xN )]. Here
yi ∈ {−1, +1} is the given label for each input xi ∈ R
d
. The function σ is defined as σ(v) =
1/(1 + e
−v
).
For this exercise, use the RBF kernel exp(−kxi − xjk
2/κ2
), where the parameter κ is κ
2 =
1
N2
PN
i,j=1 kxi − xjk
2
.
After w is obtained, for any test input x, compute p(y = 1|x) = σ(w
kx), where kx =
[k(x, x1), · · · , k(x, xj ), · · · , k(x, xN )]. If p(y = 1|x) 0.5, the predicted label y = 1, otherwise,
y = −1.
To optimize J(w), use three different methods: gradient descent and two different flavors of stochastic
gradient descent (SGD). For SGD, use p random points to estimate the gradient in each round.
Experiment with p = 1 and p = 100.
For the three methods, experiment with the step size and choose one that works for you. Stop the
iterations when the gradient becomes smaller than epsilon (say, 1e − 5). No need to cross-validate
to choose regularization factor, use a fixed value (say, 1e − 3).
For the each of the three optimization methods, report:
1. Test accuracy
2. Chosen step size
3. Number of iterations until convergence
4. Total training time (in seconds)
5. One plot of the cost J(w) as a function of time (in seconds) since the beginning of the
training. Plot the first 5-10 seconds, depending on your computer speed. To do this plot,
simply record the cost at every iteration for the first 5-10 seconds and then plot this over a
linear space from 0 to 5-10 seconds (assuming all iterations took the same amount of time)
Finally, add a single plot overlaying the cost functions of the three methods. Which methods seem
faster in the first seconds of training?

More products