$29.99
Machine Learning for Signal Processing
(ENGR-E 511; CSCI-B 590)
Homework 1
P1: Picky Eater [3 points]
1. Prof. K has a one-year-old baby, who’s a picky eater–she only eats blueberries, strawberries,
and dried yogurt drops. Prof. K decided to put her in a daycare due to the too large MLSP
course that makes him busy.
2. Since Prof. K is a good father and worries about his daughter’s health, when he picks up her
at the end of the day, he asks the daycare teachers how many blueberries, strawberries, and
yogurt drops she ate for lunch.
3. For the first week at the daycare, the toddler had quite a good amount of food, which Prof.
K recorded in a spreadsheet:
Days M T W T F
Blueberries 20 25 18 15 28
Strawberries 5 6 3 2 7
Yogurt drops 10 12 8 15 5
Table 1: A picky eater’s lunch record
4. You know, her eating behavior follows the multinomial distribution with three parameters:
pblueberry, pstrawberry, and pyogurt. Find them out using the maximum likelihood technique
(hint: don’t worry about the different numbers for different days. You can just think of the
table as an observation for the entire week by summing up the numbers. You don’t have to
derive the MLE solution, either, because I already did it for you in class, M01-L03.).
1
5. Prof. K and his wife became quite surprised, because the one-year-old usually doesn’t eat
that much blueberries at home. There is something called “daycare effect”–a kid tends to be
encouraged to eat more if there are a bunch of other kids around eating at the same time.
Even if the daycare effect is true, they suspect that their kid might not have eaten all of the
blueberries. For example, the toddler could have thrown the food on the floor like she does
at home, which might have caused overcounting.
Figure 1: Food on the floor
6. Meanwhile, since she started eating those three kinds of food, the parents have recorded the
amount she had. It turned out that the baby had eaten 1080 blueberries, 450 strawberries,
and 900 yogurt drops, for the past month before she entered the daycare. This is their a priori
knowledge.
7. Use the a priori knowledge to better estimate the three parameters. Feel free to come up
with a derivation for MAP estimation. You’ll need to do the optimization on the log of the
posterior probability (M01-L01-S13) using the Lagrange multipliers (M01-L03-S21). But I
don’t require that. I actually don’t encourage this way at least for this problem.
8. The thing is, from your intuition, you can easily come up with a MAP estimation for this
problem, like you did in P1-4 (where you could have used your intuition to find out the ML
estimation, too). Try to figure this out without math. The goal of this question is to give you
an insight about pseudo-counts.
9. In your LATEX report include your ML and MAP estimations and explain how you could get
them.
P2: Central Limit Theorem [3 points]
1. You’re taking Prof. K’s MLSP course as a residential student. It turned out that you like his
offline lectures better than the recorded versions, because he tells a lot of jokes when there’s
no camera around. So, you decided to record his offline lectures without his permission.
2
2. When you played the recording at home, you realized that the recording was contaminated
by some annoying classmates around your smartphone. They were too talkative in the middle
of the class, making Prof. K’s deep and beautiful voice inaudible.
3. You know you’ll learn about source separation soon in class, but you don’t know how to do
it yet. But, you at least want to know which recording is better than the other, because you
got to know that there’s another student recorded the same lecture. For example, you could
think that the recording with many interfering sources will be less preferred to the ones with
less interference.
4. x1.wav and x2.wav are two recordings made in the MLSP class by two devices. Even though
both of them are capturing the same part of the lecture, the numbers of interfering sources
are different from each other. To assure you that this was actually about MLSP, I’m also
providing s.wav which is Prof. K’s ground-truth speech. Listen to s.wav, x1.wav and x2.wav.
Obviously, s.wav must be the best among the three, but can you tell which one is cleaner
between x1.wav and x2.wav (i.e. with less interfering sources)?
5. You can use the Central Limit Theorem (CLT) to figure out which one is with more sources.
First, we assume that the samples from a given source are from a random variable with
unknown probabilistic distribution. We assume that it’s not a Gaussian distribution. For
example, draw a histogram from the samples of s.wav. Does it look like a Gaussian? To me it’s
too spiky, so maybe not. Although you cannot draw histograms of the other interfering sources
(because you don’t know them), let’s assume they are from a non-Gaussian distribution, too.
6. According to CLT, the sum of random variables gets closer to a Gaussian distribution. If
there are more random variables (i.e. sources) added up, the mixture of them will be more
Gaussian-like. Therefore, you can measure the Gaussian-likeness of the sample distributions
of x1.wav and x2.wav to see which one is with more sources.
7. You will use a non-Gausianity metric, called “Kurtosis,” for this. Long story short, kurtosis
is defined as follows if the distribution of the random variable x is normalized (i.e. zero mean
and unit variance):
K(x) = E(x
4
) − 3 (1)
8. For Python, it’s convenient to import LIBROSA1
, which has a load function as follows:
import librosa
x, sr = librosa.load(‘x1.wav’, sr=None)
(Note: the sr=None argument makes sure that the load function doesn’t change the original
sampling rate. For now don’t think too much about this, and just use this argument.)
9. Standardize both signals by subtracting their sample means and by dividing by their sample
standard deviation.
1http://librosa.github.io
3
10. Calculate the kurtosis. Which one is less Gaussian-like according to the kurtosis values?
Draw histograms of the two signals. Can you eyeball the graphs to see which one is less
Gaussian-like?
11. As a sanity check, compare them with the histogram of s.wav as well. Can you see the
mixture signals’ histograms are more Gaussian-like than the ground-truth source’? Calculate
the kurtosis of s.wav as well for a comparison.
12. Submit your histograms, kurtosis values of the three signals, and your verbal explanations
about your decision as to which one is with more interfering sources. Don’t forget to submit
your code.
P3: Eigenvectors for Two-Notes [5 points]
1. Write your own power iteration routine that calculates the eigenvector one by one (M01-L02-
S11).
2. Load flute.mat. It is a matrix representation of the two musical notes played by a flute.
Plot this matrix by using a color map of your choice to show the intensity of all the horizontal
bars (they’re something called harmonics). Your plot will look like the one in M01-C02-S07.
3. The input matrix X has 143 column vectors, each of which has 128 frequency elements.
Estimate two eigenvectors from this by using the power iteration routine you wrote. Plot
your eigenvectors and put them on your report along with the X matrix. This time, since
the eigenvectors are multidimensional, you need to draw a graph of each of them rather than
a line. Once again, it will look like the tall two-column matrix in the middle of M01-C02-S07.
4. Now you know the representative spectra for the two notes. How would you recover their
temporal activations? They will be two row vectors for the two notes, respectively. You need
to come up with an equation for this, and plot the activation (row) vectors you got from this
procedure in the report. Once again, it will look like the third fat matrix in M01-C02-S07.
5. Finally, since you know the basis vectors and their activations, you can recover each source
separately. How would you recover the first note? Come up with an equation for this, and
plot the result.
6. You don’t have to answer this question, but think about whether you like this separation
result or not. I will teach you some better ways to do this in the future.
P4: BFGS [5 points]
1. Load sg train.jpg. This is the clean image you use as the target of the optimization. Load
sgx train.jpg, too. It’s the noisy image you use as the input training data. Your test input
is sgx test.jpg. Replicate the denoising procedure (including testing) from page 7 to 11
in “Lecture 03 - Optimization.” Show me the denoised images (of both training and testing
image).
2. On top of the gradient descent algorithm, now do the denoising using the BFGS algorithm to
do the same denoising job. You know that you have to implement your own BFGS routine.
Report the results.
4
3. Compare the convergence behavior of the two approaches in graphs.
Hint on converting the image into the X matrix: the f vector is a vectorized version of the
2D denoising filter you’re estimating (you can construct it by concatenating all the column
vectors and coming up with a very long column vector). For example, f ∈ R
225×1
if your
denoising filter is a 15 × 15 matrix. That’s the filter you have to estimate. You can convert
your noisy input image into X by vectorizing each patch and concatenating the column
vectors horizontally, where a patch is a small section of your image with the same number of
dimensions with your filter (15×15). You collect these patches from the top-left corner to the
bottom-right corner of your input image by shifting it by one pixel each time. For example,
for an input image of 200 × 200 pixels, and a 15 × 15 filter, you have 186 × 186 = 34, 596
patches, each of which has 15 × 15 pixels. Eventually, f
⊤X will result in a cleaned-up image
in the form of a row vector, which you can reshape back to the 2D image.
Hint on logistic functions:
g(x) = 1
1 + exp(−x)
(2)
g
′
(x) = g(x)
1 − g(x)
(3)