Starting from:

$30

Problem Set 3: VC Dimension and Neural Networks

CM146
Problem Set 3: VC Dimension and Neural Networks

Submission Instructions
• Submit your solutions electronically on the course Gradescope site as PDF files.
• If you plan to typeset your solutions, please use the LaTeX solution template. If you must
submit scanned handwritten solutions, please use a black pen on blank white paper and a
high-quality scanner app.
• For questions involving math and derivation, please provide important intermediate steps and
box the final answer clearly.
• You are required to submit the code only for Question 3 - Digit Recognizer on CCLE. For
the sub-questions in Question 3 requiring you to complete a piece of code, you are expected
to copy-paste your code as a part of the solution in the submission pdf to receive full credit.
Part of this assignment is adapted from the course materials by Hsuan-Tien Lin (National Taiwan University).
1
1 VC Dimension [16 pts]
For the following problems, we classify a point x to either a positive label +1 or a negative label
−1 by a hypothesis set H.
(a) Positive ray classifier. Consider x ∈ R and H = {sign(x − b) | b ∈ R}. That is, the label is
+1 if x is greater than b otherwise −1. [8 pts]
i. For N points, prove that there are at most N + 1 label outcomes that can be generated
by H. For example, when we have 4 points, x1 ≤ x2 ≤ x3 ≤ x4, there are at most 5
label outcomes can be generated by H, as shown by the following.
x1 x2 x3 x4
+1 +1 +1 +1
−1 +1 +1 +1
−1 −1 +1 +1
−1 −1 −1 +1
−1 −1 −1 −1
ii. What is the VC dimension of H?
(b) Positive interval classifier. Consider x ∈ R and H = {sign(1(x ∈ [a, b]) − 0.5) | a, b ∈ R, a ≤
b}, where 1(.) is the indicator function. That is, the label is +1 if x in the interval [a, b]
otherwise −1. [8 pts]
i. For N points, prove that there are at most ( N2+N
2 + 1) label outcomes that can be
generated by H.
ii. What is the VC dimension of H?
2 Bound of VC dimension [16 pts]
Assume the VC dimension of an empty set is zero. Now, we have two hypothesis sets H1 and H2.
(a) Let H3 = H1 ∩ H2. Show that V C(H1) ≥ V C(H3). [6 pts]
(b) Let H3 = H1 ∪ H2. Give an example to show that V C(H1) + V C(H2) < V C(H3) is possible.
[10 pts]
3 Neural Network [20 pts]
We design a neural network to implement the XOR operation of X1, X2, X3, X4, X5. We use +1 to
represent true and −1 to represent false. Consider the following neural network.
2
We use wij to represent the weight between Xi and Hj , and use w0j to represent the weight between
the first layer bias term (+1) and Hj . We use vi to represent the weight between Hi and T, and
use v0 to represent the weight between the second layer bias term (+1) and T.
Now, let Xi ∈ {+1, −1}, Hj = sign ?P5
i=0 wijXi
?
, and T = sign ?P5
i=0 viHi
?
.
(a) Specify wij such that Hj is +1 if there are at least j positive values among X1, X2, X3, X4, X5,
otherwise −1. If there are multiple acceptable weights, you only need to write down one of
them. [8 pts]
(b) Given wij and Hj defined as above, specify vi such that the whole neural network behaves
like the XOR operation of X1, X2, X3, X4, X5. If there are multiple acceptable weights, you
only need to write down one of them. [8 pts]
(c) Justify why the output of the neural network behaves like the XOR operation of X1, X2, X3, X4, X5.
[4 pts]
4 Implementation: Digit Recognizer [48 pts]
In this exercise, you will implement a digit recognizer in pytorch. Our data contains pairs of 28×28
images xn and the corresponding digit labels yn ∈ {0, 1, 2}. For simplicity, we view a 28×28 image
xn as a 784-dimensional vector by concatenating the row pixels. In other words, xn ∈ R
784. Your
goal is to implement two digit recognizers (OneLayerNetwork and TwoLayerNetwork) and compare
their performances.
code and data
• code : Fall2020-CS146-HW3.ipynb
• data : hw3_train.csv, hw3_valid.csv, hw3_test.csv
Please use your @g.ucla.edu email id to access the code and data. Similar to HW-1, copy the colab
notebook to your drive and make the changes. Mount the drive appropriately and copy the shared
data folder to your drive to access via colab. For colab usage demo, check out the Discussion
3
recordings for Week 2 in CCLE. The notebook has marked blocks where you need to code.
### ========= T ODO : ST ART ========= ###
### ========= T ODO : END ========== ###
Note: For the questions requiring you to complete a piece of code, you are expected
to copy-paste your code as a part of the solution in the submission pdf. Tip: If you
are using LATEX, check out the Minted package (example) for code highlighting.
Data Visualization and Preparation [10 pts]
(a) Randomly select three training examples with different labels and print out the images by
using plot_img function. Include those images in your report. [2 pts]
(b) The loaded examples are numpy arrays. Convert the numpy arrays to tensors. [3 pts]
(c) Prepare train_loader, valid_loader, and test_loader by using TensorDataset and DataLoader.
We expect to get a batch of pairs (xn, yn) from the dataloader. Please set the batch size to
10. [5 pts]
You can refer https://pytorch.org/docs/stable/data.html for more information about
TensorDataset and DataLoader.
One-Layer Network [15 pts]
For one-layer network, we consider a 784–3 network. In other words, we learn a 784 × 3 weight
matrix W. Given a xn, we can compute the probability vector pn = σ(Wxn), where σ(.) is the
element-wise sigmoid function and pn,c denotes the probability of class c. Then, we focus on the
cross entropy loss

X
N
n=0
X
C
c=0
1(c = yn) log(pn,c)
where N is the number of examples, C is the number of classes, and 1 is the indicator function.
(d) Implement the constructor of OneLayerNetwork with torch.nn.Linear and implement the
forward function to compute the outputs of the single fully connected layer i.e. Wxn. Notice that we do not compute the sigmoid function here since we will use torch.nn.CrossEntropyLoss
later. [5 pts]
You can refer to https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
for more information about torch.nn.Linear and refer to https://pytorch.org/docs/
stable/generated/torch.nn.CrossEntropyLoss.html for more information about using
torch.nn.CrossEntropyLoss.
(e) Create an instance of OneLayerNetwork, set up a criterion with torch.nn.CrossEntropyLoss,
and set up a SGD optimizer with learning rate 0.0005 by using torch.optim.SGD [2 pts]
You can refer to https://pytorch.org/docs/stable/optim.html for more information about
torch.optim.SGD.
4
(f) Implement the training process. This includes forward pass, initializing gradients to zeros,
computing loss, loss backward, and updating model parameters. If you implement everything
correctly, after running the train function in main, you should get results similar to the
following. [8 pts]
Start training OneLayerNetwork...
| epoch 1 | train loss 1.075387 | train acc 0.453333 | valid loss ...
| epoch 2 | train loss 1.021301 | train acc 0.563333 | valid loss ...
| epoch 3 | train loss 0.972599 | train acc 0.630000 | valid loss ...
| epoch 4 | train loss 0.928335 | train acc 0.710000 | valid loss ...
...
Two-Layer Network [7 pts]
For two-layer network, we consider a 784–400–3 network. In other words, the first layer will consist
of a fully connected layer with 784×400 weight matrix W1 and a second layer consisting of 400×3
weight matrix W2. Given a xn, we can compute the probability vector pn = σ(W
2 σ(W
1 xn)),
where σ(.) is the element-wise sigmoid function. Again, we focus on the cross entropy loss, hence
the network will impelement W
2 σ(W
1 xn) (note the outer sigmoid will be taken care of implicitly
in our loss).
(g) Implement the constructor of TwoLayerNetwork with torch.nn.Linear and implement the
forward function to compute W
2 σ(W
1 xn). [5 pts]
(h) Create an instance of TwoLayerNetwork, set up a criterion with torch.nn.CrossEntropyLoss,
and set up a SGD optimizer with learning rate 0.0005 by using torch.optim.SGD. Then train
TwoLayerNetwork. [2 pts]
Performance Comparison [16 pts]
(i) Generate a plot depicting how one_train_loss, one_valid_loss, two_train_loss, two_valid_loss
varies with epochs. Include the plot in the report and describe your findings. [3 pts]
(j) Generate a plot depicting how one_train_acc, one_valid_acc, two_train_acc, two_valid_acc
varies with epochs. Include the plot in the report and describe your findings. [3 pts]
(k) Calculate and report the test accuracy of both the one-layer network and the two-layer network. Explain why we get such results. [3 pts]
(l) Replace the SGD optimizer with the Adam optimizer and do the experiments again. Show
the loss figure, the accuracy figure, and the test accuracy. Include the figures in the report
and describe your findings. [7 pts]
You can refer to https://pytorch.org/docs/stable/optim.html for more information about
torch.optim.Adam.
5
Submission instructions for programming problems
• Please export the notebook to a .py file by clicking the “File” → “Download.py” and upload
to CCLE.
Your code should be commented appropriately. The most important things:
– Your name and the assignment number should be at the top of each file.
– Each class and method should have an appropriate doctsring.
– If anything is complicated, it should include some comments.
There are many possible ways to approach the programming portion of this assignment, which
makes code style and comments very important so that staff can understand what you did.
For this reason, you will lose points for poorly commented or poorly organized code.
• Please submit all the plots and the rest of the solutions (other than codes) to Gradescope
6

More products