$29.99
Machine Learning for Signal Processing
(ENGR-E 511; CSCI-B 590)
Homework 2
Instructions
• Submission format: Jupyter Notebook + HTML)
– Your notebook should be a comprehensive report, not just a code snippet. Mark-ups are
mandatory to answer the homework questions. You need to use LaTeX equations in the
markup if you’re asked.
– Google Colab is the best place to begin with if this is the first time using iPython
notebook. No need to use GPUs.
– Download your notebook as an .html version and submit it as well, so that the AIs can
check out the plots and audio. Here is how to convert to html in Google Colab.
– Meaning you need to embed an audio player in there if you’re asked to submit an audio
file
• Avoid using toolboxes.
P1: White Noise [4 points]
1. Have you ever wondered what it means by “white” noise? It’s actually from the light. When
the light is the sum of all visible frequencies, then it looks white to human eyes. If you pass the
light through a prism, then you all of a sudden see the rainbow colors, so called a “spectrum.”
Yes, the prism does an analogue version of the Fourier transform.
2. So, even if we don’t see the sound we listen to, if the signal consists of too many sinusoids
with different frequencies, it sounds “white.” I know, I know, it doesn’t make sense.
3. You may also want to note that the sample distribution of a white noise signal looks like a
Gaussian distribution, which is not news to us because we all know the central limit theorem.
4. x.wav is a speech signal contaminated by white noise. As I haven’t taught you guys how to
properly do speech enhancement yet, you’re not supposed to know a machine learning-based
solution to this problem (don’t worry I’ll cover it soon). Instead, you did learn how to do
STFT, so I want you to at least manually erase the white noise from this signal to recover
the clean speech source. For some reason, we know that the white noise added to the signal
doesn’t change its volume over time. So, what we’re going to do is to listen to the sound and
eyeball the spectrogram to find out the frames only with white noise. Then, we will build
our simple noise model, with which we will suppress the noise in the other speech-plus-noise
frames.
(Note: don’t forget to turn off the sampling rate option sr=None if you use librosa.load).
1
5. First off, create a DFT matrix F using the equation shown in M02-L01-S11 and S12. You’ll
of course create a N × N complex matrix, but if you see its real and imaginary versions
separately, you’ll see something like the ones in M02-L01-S14 (the ones in the slide are 20×20,
i.e. N = 20). For this problem let’s fix N = 1024.
6. Prepare your data matrix X. You extract the first frame of N samples from the input signal,
and apply a Hann window1
. What that means is that from the definition of Hann window, you
create a window of size N and element-wise multiply the window and your N audio samples.
Place it as your first column vector of the data matrix X. Move by N/2 samples. Extract
another frame of N samples and apply the window. This goes to the second column vector of
X. Do it for your third frame (which should start from (N + 1)’th sample, and so on. Since
you moved just by the half of the frame size, your frames are overlapping each other by 50%.
(Note: this time it’s okay to use the toolbox to calculate Hann windows.)
7. Apply the DFT matrix to your data matrix, i.e. Y = F X. This is your spectrogram with
complex values. See how it looks like (by taking magnitudes and plotting). For example, you
can use imshow in matplotlib.
8. In this spectrogram, identify frames that are only with noise2
. For example the ones at the
end of signal would be a good choice. Take a sample mean of the chosen column vectors (the
original magnitudes, not the exponentiated ones), e.g. M =
1
|Cnoise|
P
i∈Cnoise
|Y :,i|, where
Cnoise is the set of chosen frames and |Cnoise| is the number of frames. This is your noise
model.
9. Subtract M out of all the magnitude spectra, |Y |. This will give you some residual magnitudes with suppressed noise. Be careful with negative values: you don’t want them in
your “magnitude” spectra. One quick method to remove them is to turn them into zeros.
Get the original phase from the input spectrogram, i.e. Y /|Y | (element-wise division), and
multiply each of the phase values by the corresponding cleaned-up magnitude to recover the
complex-valued spectra of the estimated clean speech.
10. Multiply the inverse DFT matrix, which you can also create by using the equation in S12.
Let’s call this F
∗
. Since it’s the inverse transform, F
∗F ≈ I (you can check it, although the
off diagonals might be a very small number rather than zero). You multipy this matrix to
your spectrogram, which is with suppressed white noise, to get back to the recovered version
of your data matrix, Xˆ . In theory this should give you a real-valued matrix, but you’ll still
see some imaginary parts with a very small value. Ignore them by just taking the real part.
Reverse the procedure in 1.6 to get the time domain signal. Basically it must be a procedure
that transpose every column vector of Xˆ and overlap-and-add the right half of t-th row vector
with the left half of the (t + 1)-th row vector and so on. Listen to the signal to check if the
white noise is suppressed.
1https://en.wikipedia.org/wiki/Hann_function
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.get_window.html
2Depending on the plotting function you use, it’s possible that you can’t really “see” the white noise. It’s because
your white noise is not loud enough. What you can do to better visualize this spectrogram is to exaggerate the small
magnitudes while suppress the large ones. For example, I can visualize |Y |
0.5
instead of |Y |, where exponentiation
is element-wise. Don’t worry about this visualization issue if you can see the white noise-only frames from your
spectrogram.
2
11. Submit your code and the denoised audio file. Do NOT use any STFT functions you can find
in toolboxes.
P2: DCT and PCA [4 points]
1. s.wav is a recording of Prof. K’s voice. Load it. Randomly select 8 consecutive samples out
of the 5,000,000 samples. This is your first column vector of your data matrix X. Repeat
this procedure 10 times. Then, the size of X is 8 × 10.
2. Calculate the covariance matrix out of this, whose size must be 8 ×8. Do eigendecomposition
and extract 8 eigenvectors, each of which is with 8 dimensions. Yes, you just did PCA. Plot
your W⊤ matrix and compare it to the DCT matrix shown in M02-L01-S21. Similar? Submit
your plot and code.
3. Create another data matrix with 100 samples, i.e. X ∈ R
8×100. Do PCA on this one. How
about 1,000 samples? Can you see your PCA is getting better with larger datasets? Why do
you feel that your PCA is getting better? Try to explain in comparison with the DCT matrix.
4. You just saw that PCA might be able to replace the pre-fixed DCT basis vectors. But, as
you can see in your matrices, they are not the same. Discuss the pros and cons of PCA and
DCT in your report.
P3: Stereo matching [3 points]
1. If you have multiple cameras taking the same scene from different positions, you can recover
the depth of the objects. That’s why we humans can recognize the distance of a visual object
(we have two eyes). See Figure 1 for an example. But, I guess that our brains are working
hard to indeed estimate the depths of the objects in the visual scene. In this problem we
mimic this process (without exactly knowing how brain works).
2. im0.ppm (left) and im8.ppm (right) are the pictures taken by two different camera positions3
.
If you load the images, they will be a three dimensional array of 381 × 430 × 3, whose third
dimension is for the three color channels (RGB). Let’s call them XL and XR. For the (i,j)-th
pixel in the right image, XR
(i,j,:), which is a 3-d vector of RGB intensities, we can scan and
find the most similar pixel in the left image at i-th row (using a metric of your choice). For
example, I did the search from XL
(i,j,:) to XL
(i,j+39,:), to see which pixel among the 40 are
the closest. I record the index-distance of the closest pixel. Let’s say that XL
(i,j+19,:) is the
most similar one to XR
(i,j,:). Then, the index-distance is 19. I record this index-distance (to
the closest pixel in the left image) for all pixels in my right image to create a matrix called
“disparity map”, D, whose (i, j)-th element says the index-distance between the (i, j)-th pixel
of the right image and its closest pixel in the left image. For an object in the right image
(e.g. the tree), if its pixels are associated with an object in the left image, but are shifted far
away, that means the object is close to the cameras, and vice versa.
3. Calculate the disparity map D from im0.ppm and im8.ppm, which will be a matrix of 381×390
(since we search within only 40 pixels). Vectorize the disparity matrix and draw a histogram.
How many clusters do you see?
3http://vision.middlebury.edu/stereo/data/
3
Left image Right image
Figure 1: The tree is closer than the mountain. So, from the left camera, the tree is located on
the right hand side, while the right camera captures it on the left hand side. On the contrary, the
mountain in the back does not have this disparity.
4. Submit your histogram and answer in the report. Submit your code that created the disparity
map, too.
P4: GMM and kmeans clustering for stereo matching [5 points]
1. Write up your own k-means clustering code, and cluster the disparity values in D. Each
value will belong to (only) one of the clusters. The number of clusters says the number of
depth levels. For example, in Figure 1, there are only two depths, so k = 2. If you replace
the disparity values with the cluster means, you can recover the depth map with k levels.
Plot your depth map (the disparity map replaced by the mean disparities as in the image
quantization examples) in gray scale–pixels of the frontal objects should be bright, while the
ones in the back get darker. Submit your plot along with your k-means clustering code.
2. Write up your own GMM clustering code, and cluster the disparity values in D. The posterior
probability will give you the (soft) membership of each value to one of the clusters. Recover
the depth map using the means you earned through GMM, and submit the plot along with
your code.
4