COMP 4471 & ELEC 4240 Name:
Spring 2020
Midterm
14 April 2020
Time Limit: 80 Minutes Student ID:
This exam contains 9 pages (including this cover page) and 4 questions.
Total of points is 100.
You can use a single-sided cheat sheet for the midterm.
Grade Table (for teacher use only)
Question Points Score
1 40
2 20
3 20
4 20
Total: 100
• Prepare 3 white papers. Or you can print the PDF and write on the printed midterm.
You can also use a tablet to answer the questions if you want.
• Prepare a black pen and a smartphone to capture images
• Sign the honor code. No communication among students
• Turn on video cameras (ensure that’s you)
• Open book exam. Free to browse materials (including the hyperlinks to external web-
sites) on the course website
• No Google. No external websites during the exam, but you can download external
materials in advance
• Exam Time: 12:00 pm – 1:20 pm. Please join Zoom (https://hkust.zoom.us/j/794864322)
at 11:50 am.
• A PDF of the midterm will be shared at 12:00 pm in the Zoom chatroom and via email
announcement
• Write your name, student ID, and answers on your white papers
• Take clear images of your answers
COMP 4471 & ELEC 4240 Midterm - Page 2 of 9 14 April 2020
• Send an email to comp4471.hkust@gmail.com with images of your answers by 1:35 pm.
Or you can upload them to Canvas by 1:35 pm
• If there is any technical issue of uploading images, you should send an email to
comp4471.hkust@gmail.com with the only final answer in text to each question by
1:50 pm. Only choices and final numbers/equations are needed in the email as a
record. Then you can send your images by 3 pm
• We will check all the submissions by 2 pm to ensure every student has submitted their
midterms correctly
• Ask Questions during Midterm via the chatroom on Zoom
• TAs and the instructor will be on Zoom 11:45 am - 2 pm
Honor Code
Honesty and integrity are central to the academic work of HKUST. Students of the University
must observe and uphold the highest standards of academic integrity and honesty in all the
work they do throughout their program of study.
As members of the University community, students have the responsibility to help main-
tain the academic reputation of HKUST in its academic endeavors.
Sanctions will be imposed on students, if they are found to have violated the regulations
governing academic integrity and honesty.
Please write ”I have read and understood the honor code” on your
white paper after your name and student ID.
COMP 4471 & ELEC 4240 Midterm - Page 3 of 9 14 April 2020
1. (40 points) Short questions. Please choose the right choices for each question. There
may be more than one correct choice.
1. Why deep learning models are more preferable than classical machine learning
methods in image classification? AlexNet has shown significant improvement in
the ImageNet challenge in 2012.
(a) It is faster to train a deep learning model
(b) Features are hand-crafted in classical machine learning methods, but
are learned automatically in deep learning
(c) Deep learning models can be trained on a big dataset but classical machine
learning methods can not
(d) Deep learning models can be trained on GPU but classical machine learning
methods can not
2. Elastic Net regularizer :
(a) Has L1-Norm
(b) Has L2-Norm
(c) Prevents over-fitting
(d) Is differentiable everywhere
3. Which are true about second-order optimization:
(a) It is theoretically more optimal than first order optimization (we will not pe-
nalize choosing this choice)
(b) It is often used in practice for training deep learning models
(c) It is computationally expensive
(d) Adam is second-order optimization
4. Consider a four-layer convolutional network with only 3× 3 dilated convolutions1.
In the first layer, the dilation rate is 1. In the second layer, the dilation rate is 2.
In the third layer, the dilation rate is 4. In the fourth layer, the dilation rate is
1. What is the receptive field of each neuron in the activation map right after the
fourth layer?
(a) 17
(b) 15
(c) 13
(d) 11
5. Which are true about dynamic and static computational graph in Tensorflow and
PyTorch?
(a) We can build the computational graph first and execute computation
later in static computation graph.
1In case you are not familiar with dilated convolutions, you can find the details at https://cs231n.
github.io/convolutional-networks/ (search “dilated”)
COMP 4471 & ELEC 4240 Midterm - Page 4 of 9 14 April 2020
(b) We can build the computational graph first and execute computation
later in dynamic computation graph.
(c) We can build the computational graph and execute computation simultaneously
in static computation graph.
(d) We can build the computational graph and execute computation si-
multaneously in dynamic computation graph.
6. Which of the following are true for the multiclass SVM loss (assume that we com-
pute the loss on a single image)? Select all that apply: No correct answer
(a) It is positive if and only if the correct-class score is not the highest among all
the scores
(b) It is positive if and only if the correct-class score is the highest among all the
scores
(c) It is positive if and only if the correct-class score is higher than the second-
largest score (among all the scores) by a certain margin
(d) It can be negative
7. Which of the following are true of convolutional neural networks for image analysis:
(a) Filters in earlier layers can be replaced by classical edge detectors detectors
(b) Pooling layers reduce the spatial resolution of the image
(c) They have more parameters than fully connected networks with the same num-
ber of layers and the same numbers of neurons in each layer
(d) A convolutional neural network can be trained for unsupervised learning tasks,
whereas an ordinary neural net cannot
8. Which layer may have the largest number of trainable parameters? You can assume
the input to this layer has a unknown dimension N ×M ×D where N,M,D ≥ 1.
(a) A convolutional layer with 10 3×3 filters (with biases)
(b) A convolutional layer with 4 5×5 filters (with biases)
(c) A 2× 2 max-pooling layer
(d) A fully connected layer than maps the input to 10 scores
9. Recurrent neural networks (RNNs) are often applied for sequential data, since:
(a) Training time required is shorter than that of CNNs.
(b) RNNs can theoretically handle infinity length sequences.
(c) RNNs models are less likely to suffer from gradient vanish.
(d) RNNs can be used for generation tasks, which is impossible for CNNs.
10. If your test error is not good while training error is good, which may improve your
error? Select all that apply:
(a) Apply cross-validation and choose better hyperparameters.
(b) Add data augmentation
(c) Add dropout layers in the model.
(d) Apply early stopping.
COMP 4471 & ELEC 4240 Midterm - Page 5 of 9 14 April 2020
2. (20 points) Short questions. Only the final answers are needed for each question.
1. Consider a simple model where z = ReLU(x)× ReLU(y) + ReLU(x) + ReLU(y),
x = −1, and y = 1. What are ∂z
∂x
and ∂z
∂y
?
∂z
∂x
= 0
∂z
∂y
= 1
2. Suppose a loss function L = (Ax)T (By) where x, y are column vectors, A,B are
square matrices, and L is a scalar. What are ∂L
∂x
and ∂L
∂y
?
The expected derivative has to be in same dimension with variable in
practice (i.e. column vectors). However, we accept row vectors answer
as long as the dimensions are consistent
∂L
∂x
= (By)TA , ∂L
∂y
= (Ax)TB
or
∂L
∂x
= ((By)TA)T , ∂L
∂y
= ((Ax)TB)T
3. True or False. In a neural network, an activation function must be differentiable
everywhere so that back-propagation can be performed. If true, please explain why;
if false, please provide a counter example.
False. ReLU is not differentiable everywhere.
4. For AdaGrad (see Slide 27 of Lecture 7), there is a step
x-=learning rate* dx /(np.sqrt(grad squared)+1e-7). What is the role of 1e-7?
1e-7 is used to prevent division of zero.
COMP 4471 & ELEC 4240 Midterm - Page 6 of 9 14 April 2020
3. (20 points) Consider a convolutional network that takes a 32 × 32 × 3 RGB image as
input. It has 15 convolutional layers followed by a 2 × 2 average pooling. Each convo-
lutional layer has 64 3 × 3 convolutional filters with biases without padding. ReLU is
used after each convolutional layer.
(a) (4 points) What is the size of the network output?
It is a vector of size 64, or 1× 1× 64.
(b) (4 points) Determine the number of parameters in the first convolutional layer.
(Biases are used)
64× (3× 3× 3 + 1)
(c) (4 points) Determine the number of parameters in the average pooling layer.
0
(d) (4 points) Determine the number of parameters in this network. (Biases are used)
64× (3× 3× 3 + 1) + 14× 64× (3× 3× 64 + 1)
(e) (4 points) Now suppose we add one more fully-connected layer at the end of the
network so that it outputs 100 scores. What is the dimension of this fully connected
layer. (Biases are used)
The weight of FC has the dimension 64× 100 and the bias is a vector of size 100.
COMP 4471 & ELEC 4240 Midterm - Page 7 of 9 14 April 2020
4. (20 points) The following figure shows a simple single-layer, single-output network.
In this model, x,W ∈ Rn, and a, z ∈ R, x is an input vector, W is a hidden weight
vector, a = WTx and z = f(a) for some activation function f . For this problem , we
will use the logistic transition function and mean squared error loss, so:
f(a) = z =
1
1 + e−a
L(z) = 1
2
(y − z)2
where L is the stochastic loss over a single input pair (x, y).
(a) Given the derivative f ′ = f(1 − f), derive the simplest expression for ∂L
∂W
and ∂L
∂x
in terms of x, y, z,W and/or x.
By the chain rule,
∂L
∂W
=
∂L
∂z
∂z
∂a
∂a
∂W
= −(y − z)z(1− z)x
Similarly,
∂L
∂x
=
∂L
∂z
∂z
∂a
∂a
∂x
= −z(1− z)(y − z)W
COMP 4471 & ELEC 4240 Midterm - Page 8 of 9 14 April 2020
Now let us widen the network: in the following model, the input x ∈ Rn, the
outputs a, z ∈ Rm, and the weight matrix W ∈ Rm×n. z = f(a), where f is the
sigmoid function applied elementwise. The loss is still a scalar:
L =
m∑
i=1
1
2
(yi − zi)2
(b) Use your previous result in (a) to derive the simplest expression for ∂L
∂Wi,j
, by con-
sidering the two cases
∂( 1
2
(yk−zk)2)
∂Wi,j
where k = i and k 6= i.
∂L
∂Wi,j
= −zi(1− zi)(yi − zi)xj.
since a = Wx, the value ak is equal to Wkx where Wk is the kth row of W. We
consider the two cases ∂(1
2
(yk − zk)2)∂Wi,j where k = i and k 6= i. The first case
reduces to our previous answer −zi(1−zi)(yi−zi)xj. The second case is zero, since
we established that ai, and therefore zi, is calculated using only values in the kth
row of W. The derivative of a sum is the sum of the derivatives, so the derivative
of the loss is equal to our first case plus zero: −zi(1− zi)(yi − zi)xj.
(c) Using your result in (b) show that the gradient of the loss ∂L
∂W
can be written as the
matrix product of two vectors. Hint: By the definition of matrix multiplication,
M = AB iff Mik =
∑
j AijBjk. We can set j = 1 in which case for any two vectors
a and b, M = abT iff Mij = aibj. What are a and b? You can use u v to
indicate elementwise multiplication between two vectors u and v.
Let a = −z × (1 − z) × (y − z) and let b = x, where × represents element-wise
multiplication. From part (b) we have ∂L
∂Wi,j
= aibj, so it follows that the full
matrix ∂L
∂W
= abT .
(d) Derive the simplest expression for ∂L
∂x
. You can use u v to indicate elementwise
multiplication between two vectors u and v.
∂L
∂x
= −Wt(z× (1− z)× (y − z))
COMP 4471 & ELEC 4240 Midterm - Page 9 of 9 14 April 2020
where × represents elementwise multiplication.
Let Li = 12(yi − zi). From (a), we know that
∂Li
∂x
= −zi(1− zi)(yi − zi)WTi
where Wi is the ith row of W. We transpose it because this gradient is a column
vector. Since L =∑i Li and the derivative of a sum is the sum of the derivatives,
∂L
∂x
=
∑
i
−zi(1− zi)(yi − zi)WTi
=
∑
i
−WTi (zi(1− zi)(yi − zi))
since zi(1 − zi)(yi − z) is a scalar. By the definition of matrix multiplication, this
is equivalent to −Wt(z× (1− z)× (y − z)).