xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

COMP 4471-COMP4471 Python代写

时间：2021-04-06

COMP 4471 & ELEC 4240 Name:

Spring 2020

Midterm

14 April 2020

Time Limit: 80 Minutes Student ID:

This exam contains 9 pages (including this cover page) and 4 questions.

Total of points is 100.

You can use a single-sided cheat sheet for the midterm.

Grade Table (for teacher use only)

Question Points Score

1 40

2 20

3 20

4 20

Total: 100

• Prepare 3 white papers. Or you can print the PDF and write on the printed midterm.

You can also use a tablet to answer the questions if you want.

• Prepare a black pen and a smartphone to capture images

• Sign the honor code. No communication among students

• Turn on video cameras (ensure that’s you)

• Open book exam. Free to browse materials (including the hyperlinks to external web-

sites) on the course website

• No Google. No external websites during the exam, but you can download external

materials in advance

• Exam Time: 12:00 pm – 1:20 pm. Please join Zoom (https://hkust.zoom.us/j/794864322)

at 11:50 am.

• A PDF of the midterm will be shared at 12:00 pm in the Zoom chatroom and via email

announcement

• Write your name, student ID, and answers on your white papers

• Take clear images of your answers

COMP 4471 & ELEC 4240 Midterm - Page 2 of 9 14 April 2020

• Send an email to comp4471.hkust@gmail.com with images of your answers by 1:35 pm.

Or you can upload them to Canvas by 1:35 pm

• If there is any technical issue of uploading images, you should send an email to

comp4471.hkust@gmail.com with the only final answer in text to each question by

1:50 pm. Only choices and final numbers/equations are needed in the email as a

record. Then you can send your images by 3 pm

• We will check all the submissions by 2 pm to ensure every student has submitted their

midterms correctly

• Ask Questions during Midterm via the chatroom on Zoom

• TAs and the instructor will be on Zoom 11:45 am - 2 pm

Honor Code

Honesty and integrity are central to the academic work of HKUST. Students of the University

must observe and uphold the highest standards of academic integrity and honesty in all the

work they do throughout their program of study.

As members of the University community, students have the responsibility to help main-

tain the academic reputation of HKUST in its academic endeavors.

Sanctions will be imposed on students, if they are found to have violated the regulations

governing academic integrity and honesty.

Please write ”I have read and understood the honor code” on your

white paper after your name and student ID.

COMP 4471 & ELEC 4240 Midterm - Page 3 of 9 14 April 2020

1. (40 points) Short questions. Please choose the right choices for each question. There

may be more than one correct choice.

1. Why deep learning models are more preferable than classical machine learning

methods in image classification? AlexNet has shown significant improvement in

the ImageNet challenge in 2012.

(a) It is faster to train a deep learning model

(b) Features are hand-crafted in classical machine learning methods, but

are learned automatically in deep learning

(c) Deep learning models can be trained on a big dataset but classical machine

learning methods can not

(d) Deep learning models can be trained on GPU but classical machine learning

methods can not

2. Elastic Net regularizer :

(a) Has L1-Norm

(b) Has L2-Norm

(c) Prevents over-fitting

(d) Is differentiable everywhere

3. Which are true about second-order optimization:

(a) It is theoretically more optimal than first order optimization (we will not pe-

nalize choosing this choice)

(b) It is often used in practice for training deep learning models

(c) It is computationally expensive

(d) Adam is second-order optimization

4. Consider a four-layer convolutional network with only 3× 3 dilated convolutions1.

In the first layer, the dilation rate is 1. In the second layer, the dilation rate is 2.

In the third layer, the dilation rate is 4. In the fourth layer, the dilation rate is

1. What is the receptive field of each neuron in the activation map right after the

fourth layer?

(a) 17

(b) 15

(c) 13

(d) 11

5. Which are true about dynamic and static computational graph in Tensorflow and

PyTorch?

(a) We can build the computational graph first and execute computation

later in static computation graph.

1In case you are not familiar with dilated convolutions, you can find the details at https://cs231n.

github.io/convolutional-networks/ (search “dilated”)

COMP 4471 & ELEC 4240 Midterm - Page 4 of 9 14 April 2020

(b) We can build the computational graph first and execute computation

later in dynamic computation graph.

(c) We can build the computational graph and execute computation simultaneously

in static computation graph.

(d) We can build the computational graph and execute computation si-

multaneously in dynamic computation graph.

6. Which of the following are true for the multiclass SVM loss (assume that we com-

pute the loss on a single image)? Select all that apply: No correct answer

(a) It is positive if and only if the correct-class score is not the highest among all

the scores

(b) It is positive if and only if the correct-class score is the highest among all the

scores

(c) It is positive if and only if the correct-class score is higher than the second-

largest score (among all the scores) by a certain margin

(d) It can be negative

7. Which of the following are true of convolutional neural networks for image analysis:

(a) Filters in earlier layers can be replaced by classical edge detectors detectors

(b) Pooling layers reduce the spatial resolution of the image

(c) They have more parameters than fully connected networks with the same num-

ber of layers and the same numbers of neurons in each layer

(d) A convolutional neural network can be trained for unsupervised learning tasks,

whereas an ordinary neural net cannot

8. Which layer may have the largest number of trainable parameters? You can assume

the input to this layer has a unknown dimension N ×M ×D where N,M,D ≥ 1.

(a) A convolutional layer with 10 3×3 filters (with biases)

(b) A convolutional layer with 4 5×5 filters (with biases)

(c) A 2× 2 max-pooling layer

(d) A fully connected layer than maps the input to 10 scores

9. Recurrent neural networks (RNNs) are often applied for sequential data, since:

(a) Training time required is shorter than that of CNNs.

(b) RNNs can theoretically handle infinity length sequences.

(c) RNNs models are less likely to suffer from gradient vanish.

(d) RNNs can be used for generation tasks, which is impossible for CNNs.

10. If your test error is not good while training error is good, which may improve your

error? Select all that apply:

(a) Apply cross-validation and choose better hyperparameters.

(b) Add data augmentation

(c) Add dropout layers in the model.

(d) Apply early stopping.

COMP 4471 & ELEC 4240 Midterm - Page 5 of 9 14 April 2020

2. (20 points) Short questions. Only the final answers are needed for each question.

1. Consider a simple model where z = ReLU(x)× ReLU(y) + ReLU(x) + ReLU(y),

x = −1, and y = 1. What are ∂z

∂x

and ∂z

∂y

?

∂z

∂x

= 0

∂z

∂y

= 1

2. Suppose a loss function L = (Ax)T (By) where x, y are column vectors, A,B are

square matrices, and L is a scalar. What are ∂L

∂x

and ∂L

∂y

?

The expected derivative has to be in same dimension with variable in

practice (i.e. column vectors). However, we accept row vectors answer

as long as the dimensions are consistent

∂L

∂x

= (By)TA , ∂L

∂y

= (Ax)TB

or

∂L

∂x

= ((By)TA)T , ∂L

∂y

= ((Ax)TB)T

3. True or False. In a neural network, an activation function must be differentiable

everywhere so that back-propagation can be performed. If true, please explain why;

if false, please provide a counter example.

False. ReLU is not differentiable everywhere.

4. For AdaGrad (see Slide 27 of Lecture 7), there is a step

x-=learning rate* dx /(np.sqrt(grad squared)+1e-7). What is the role of 1e-7?

1e-7 is used to prevent division of zero.

COMP 4471 & ELEC 4240 Midterm - Page 6 of 9 14 April 2020

3. (20 points) Consider a convolutional network that takes a 32 × 32 × 3 RGB image as

input. It has 15 convolutional layers followed by a 2 × 2 average pooling. Each convo-

lutional layer has 64 3 × 3 convolutional filters with biases without padding. ReLU is

used after each convolutional layer.

(a) (4 points) What is the size of the network output?

It is a vector of size 64, or 1× 1× 64.

(b) (4 points) Determine the number of parameters in the first convolutional layer.

(Biases are used)

64× (3× 3× 3 + 1)

(c) (4 points) Determine the number of parameters in the average pooling layer.

0

(d) (4 points) Determine the number of parameters in this network. (Biases are used)

64× (3× 3× 3 + 1) + 14× 64× (3× 3× 64 + 1)

(e) (4 points) Now suppose we add one more fully-connected layer at the end of the

network so that it outputs 100 scores. What is the dimension of this fully connected

layer. (Biases are used)

The weight of FC has the dimension 64× 100 and the bias is a vector of size 100.

COMP 4471 & ELEC 4240 Midterm - Page 7 of 9 14 April 2020

4. (20 points) The following figure shows a simple single-layer, single-output network.

In this model, x,W ∈ Rn, and a, z ∈ R, x is an input vector, W is a hidden weight

vector, a = WTx and z = f(a) for some activation function f . For this problem , we

will use the logistic transition function and mean squared error loss, so:

f(a) = z =

1

1 + e−a

L(z) = 1

2

(y − z)2

where L is the stochastic loss over a single input pair (x, y).

(a) Given the derivative f ′ = f(1 − f), derive the simplest expression for ∂L

∂W

and ∂L

∂x

in terms of x, y, z,W and/or x.

By the chain rule,

∂L

∂W

=

∂L

∂z

∂z

∂a

∂a

∂W

= −(y − z)z(1− z)x

Similarly,

∂L

∂x

=

∂L

∂z

∂z

∂a

∂a

∂x

= −z(1− z)(y − z)W

COMP 4471 & ELEC 4240 Midterm - Page 8 of 9 14 April 2020

Now let us widen the network: in the following model, the input x ∈ Rn, the

outputs a, z ∈ Rm, and the weight matrix W ∈ Rm×n. z = f(a), where f is the

sigmoid function applied elementwise. The loss is still a scalar:

L =

m∑

i=1

1

2

(yi − zi)2

(b) Use your previous result in (a) to derive the simplest expression for ∂L

∂Wi,j

, by con-

sidering the two cases

∂( 1

2

(yk−zk)2)

∂Wi,j

where k = i and k 6= i.

∂L

∂Wi,j

= −zi(1− zi)(yi − zi)xj.

since a = Wx, the value ak is equal to Wkx where Wk is the kth row of W. We

consider the two cases ∂(1

2

(yk − zk)2)∂Wi,j where k = i and k 6= i. The first case

reduces to our previous answer −zi(1−zi)(yi−zi)xj. The second case is zero, since

we established that ai, and therefore zi, is calculated using only values in the kth

row of W. The derivative of a sum is the sum of the derivatives, so the derivative

of the loss is equal to our first case plus zero: −zi(1− zi)(yi − zi)xj.

(c) Using your result in (b) show that the gradient of the loss ∂L

∂W

can be written as the

matrix product of two vectors. Hint: By the definition of matrix multiplication,

M = AB iff Mik =

∑

j AijBjk. We can set j = 1 in which case for any two vectors

a and b, M = abT iff Mij = aibj. What are a and b? You can use u v to

indicate elementwise multiplication between two vectors u and v.

Let a = −z × (1 − z) × (y − z) and let b = x, where × represents element-wise

multiplication. From part (b) we have ∂L

∂Wi,j

= aibj, so it follows that the full

matrix ∂L

∂W

= abT .

(d) Derive the simplest expression for ∂L

∂x

. You can use u v to indicate elementwise

multiplication between two vectors u and v.

∂L

∂x

= −Wt(z× (1− z)× (y − z))

COMP 4471 & ELEC 4240 Midterm - Page 9 of 9 14 April 2020

where × represents elementwise multiplication.

Let Li = 12(yi − zi). From (a), we know that

∂Li

∂x

= −zi(1− zi)(yi − zi)WTi

where Wi is the ith row of W. We transpose it because this gradient is a column

vector. Since L =∑i Li and the derivative of a sum is the sum of the derivatives,

∂L

∂x

=

∑

i

−zi(1− zi)(yi − zi)WTi

=

∑

i

−WTi (zi(1− zi)(yi − zi))

since zi(1 − zi)(yi − z) is a scalar. By the definition of matrix multiplication, this

is equivalent to −Wt(z× (1− z)× (y − z)).

Spring 2020

Midterm

14 April 2020

Time Limit: 80 Minutes Student ID:

This exam contains 9 pages (including this cover page) and 4 questions.

Total of points is 100.

You can use a single-sided cheat sheet for the midterm.

Grade Table (for teacher use only)

Question Points Score

1 40

2 20

3 20

4 20

Total: 100

• Prepare 3 white papers. Or you can print the PDF and write on the printed midterm.

You can also use a tablet to answer the questions if you want.

• Prepare a black pen and a smartphone to capture images

• Sign the honor code. No communication among students

• Turn on video cameras (ensure that’s you)

• Open book exam. Free to browse materials (including the hyperlinks to external web-

sites) on the course website

• No Google. No external websites during the exam, but you can download external

materials in advance

• Exam Time: 12:00 pm – 1:20 pm. Please join Zoom (https://hkust.zoom.us/j/794864322)

at 11:50 am.

• A PDF of the midterm will be shared at 12:00 pm in the Zoom chatroom and via email

announcement

• Write your name, student ID, and answers on your white papers

• Take clear images of your answers

COMP 4471 & ELEC 4240 Midterm - Page 2 of 9 14 April 2020

• Send an email to comp4471.hkust@gmail.com with images of your answers by 1:35 pm.

Or you can upload them to Canvas by 1:35 pm

• If there is any technical issue of uploading images, you should send an email to

comp4471.hkust@gmail.com with the only final answer in text to each question by

1:50 pm. Only choices and final numbers/equations are needed in the email as a

record. Then you can send your images by 3 pm

• We will check all the submissions by 2 pm to ensure every student has submitted their

midterms correctly

• Ask Questions during Midterm via the chatroom on Zoom

• TAs and the instructor will be on Zoom 11:45 am - 2 pm

Honor Code

Honesty and integrity are central to the academic work of HKUST. Students of the University

must observe and uphold the highest standards of academic integrity and honesty in all the

work they do throughout their program of study.

As members of the University community, students have the responsibility to help main-

tain the academic reputation of HKUST in its academic endeavors.

Sanctions will be imposed on students, if they are found to have violated the regulations

governing academic integrity and honesty.

Please write ”I have read and understood the honor code” on your

white paper after your name and student ID.

COMP 4471 & ELEC 4240 Midterm - Page 3 of 9 14 April 2020

1. (40 points) Short questions. Please choose the right choices for each question. There

may be more than one correct choice.

1. Why deep learning models are more preferable than classical machine learning

methods in image classification? AlexNet has shown significant improvement in

the ImageNet challenge in 2012.

(a) It is faster to train a deep learning model

(b) Features are hand-crafted in classical machine learning methods, but

are learned automatically in deep learning

(c) Deep learning models can be trained on a big dataset but classical machine

learning methods can not

(d) Deep learning models can be trained on GPU but classical machine learning

methods can not

2. Elastic Net regularizer :

(a) Has L1-Norm

(b) Has L2-Norm

(c) Prevents over-fitting

(d) Is differentiable everywhere

3. Which are true about second-order optimization:

(a) It is theoretically more optimal than first order optimization (we will not pe-

nalize choosing this choice)

(b) It is often used in practice for training deep learning models

(c) It is computationally expensive

(d) Adam is second-order optimization

4. Consider a four-layer convolutional network with only 3× 3 dilated convolutions1.

In the first layer, the dilation rate is 1. In the second layer, the dilation rate is 2.

In the third layer, the dilation rate is 4. In the fourth layer, the dilation rate is

1. What is the receptive field of each neuron in the activation map right after the

fourth layer?

(a) 17

(b) 15

(c) 13

(d) 11

5. Which are true about dynamic and static computational graph in Tensorflow and

PyTorch?

(a) We can build the computational graph first and execute computation

later in static computation graph.

1In case you are not familiar with dilated convolutions, you can find the details at https://cs231n.

github.io/convolutional-networks/ (search “dilated”)

COMP 4471 & ELEC 4240 Midterm - Page 4 of 9 14 April 2020

(b) We can build the computational graph first and execute computation

later in dynamic computation graph.

(c) We can build the computational graph and execute computation simultaneously

in static computation graph.

(d) We can build the computational graph and execute computation si-

multaneously in dynamic computation graph.

6. Which of the following are true for the multiclass SVM loss (assume that we com-

pute the loss on a single image)? Select all that apply: No correct answer

(a) It is positive if and only if the correct-class score is not the highest among all

the scores

(b) It is positive if and only if the correct-class score is the highest among all the

scores

(c) It is positive if and only if the correct-class score is higher than the second-

largest score (among all the scores) by a certain margin

(d) It can be negative

7. Which of the following are true of convolutional neural networks for image analysis:

(a) Filters in earlier layers can be replaced by classical edge detectors detectors

(b) Pooling layers reduce the spatial resolution of the image

(c) They have more parameters than fully connected networks with the same num-

ber of layers and the same numbers of neurons in each layer

(d) A convolutional neural network can be trained for unsupervised learning tasks,

whereas an ordinary neural net cannot

8. Which layer may have the largest number of trainable parameters? You can assume

the input to this layer has a unknown dimension N ×M ×D where N,M,D ≥ 1.

(a) A convolutional layer with 10 3×3 filters (with biases)

(b) A convolutional layer with 4 5×5 filters (with biases)

(c) A 2× 2 max-pooling layer

(d) A fully connected layer than maps the input to 10 scores

9. Recurrent neural networks (RNNs) are often applied for sequential data, since:

(a) Training time required is shorter than that of CNNs.

(b) RNNs can theoretically handle infinity length sequences.

(c) RNNs models are less likely to suffer from gradient vanish.

(d) RNNs can be used for generation tasks, which is impossible for CNNs.

10. If your test error is not good while training error is good, which may improve your

error? Select all that apply:

(a) Apply cross-validation and choose better hyperparameters.

(b) Add data augmentation

(c) Add dropout layers in the model.

(d) Apply early stopping.

COMP 4471 & ELEC 4240 Midterm - Page 5 of 9 14 April 2020

2. (20 points) Short questions. Only the final answers are needed for each question.

1. Consider a simple model where z = ReLU(x)× ReLU(y) + ReLU(x) + ReLU(y),

x = −1, and y = 1. What are ∂z

∂x

and ∂z

∂y

?

∂z

∂x

= 0

∂z

∂y

= 1

2. Suppose a loss function L = (Ax)T (By) where x, y are column vectors, A,B are

square matrices, and L is a scalar. What are ∂L

∂x

and ∂L

∂y

?

The expected derivative has to be in same dimension with variable in

practice (i.e. column vectors). However, we accept row vectors answer

as long as the dimensions are consistent

∂L

∂x

= (By)TA , ∂L

∂y

= (Ax)TB

or

∂L

∂x

= ((By)TA)T , ∂L

∂y

= ((Ax)TB)T

3. True or False. In a neural network, an activation function must be differentiable

everywhere so that back-propagation can be performed. If true, please explain why;

if false, please provide a counter example.

False. ReLU is not differentiable everywhere.

4. For AdaGrad (see Slide 27 of Lecture 7), there is a step

x-=learning rate* dx /(np.sqrt(grad squared)+1e-7). What is the role of 1e-7?

1e-7 is used to prevent division of zero.

COMP 4471 & ELEC 4240 Midterm - Page 6 of 9 14 April 2020

3. (20 points) Consider a convolutional network that takes a 32 × 32 × 3 RGB image as

input. It has 15 convolutional layers followed by a 2 × 2 average pooling. Each convo-

lutional layer has 64 3 × 3 convolutional filters with biases without padding. ReLU is

used after each convolutional layer.

(a) (4 points) What is the size of the network output?

It is a vector of size 64, or 1× 1× 64.

(b) (4 points) Determine the number of parameters in the first convolutional layer.

(Biases are used)

64× (3× 3× 3 + 1)

(c) (4 points) Determine the number of parameters in the average pooling layer.

0

(d) (4 points) Determine the number of parameters in this network. (Biases are used)

64× (3× 3× 3 + 1) + 14× 64× (3× 3× 64 + 1)

(e) (4 points) Now suppose we add one more fully-connected layer at the end of the

network so that it outputs 100 scores. What is the dimension of this fully connected

layer. (Biases are used)

The weight of FC has the dimension 64× 100 and the bias is a vector of size 100.

COMP 4471 & ELEC 4240 Midterm - Page 7 of 9 14 April 2020

4. (20 points) The following figure shows a simple single-layer, single-output network.

In this model, x,W ∈ Rn, and a, z ∈ R, x is an input vector, W is a hidden weight

vector, a = WTx and z = f(a) for some activation function f . For this problem , we

will use the logistic transition function and mean squared error loss, so:

f(a) = z =

1

1 + e−a

L(z) = 1

2

(y − z)2

where L is the stochastic loss over a single input pair (x, y).

(a) Given the derivative f ′ = f(1 − f), derive the simplest expression for ∂L

∂W

and ∂L

∂x

in terms of x, y, z,W and/or x.

By the chain rule,

∂L

∂W

=

∂L

∂z

∂z

∂a

∂a

∂W

= −(y − z)z(1− z)x

Similarly,

∂L

∂x

=

∂L

∂z

∂z

∂a

∂a

∂x

= −z(1− z)(y − z)W

COMP 4471 & ELEC 4240 Midterm - Page 8 of 9 14 April 2020

Now let us widen the network: in the following model, the input x ∈ Rn, the

outputs a, z ∈ Rm, and the weight matrix W ∈ Rm×n. z = f(a), where f is the

sigmoid function applied elementwise. The loss is still a scalar:

L =

m∑

i=1

1

2

(yi − zi)2

(b) Use your previous result in (a) to derive the simplest expression for ∂L

∂Wi,j

, by con-

sidering the two cases

∂( 1

2

(yk−zk)2)

∂Wi,j

where k = i and k 6= i.

∂L

∂Wi,j

= −zi(1− zi)(yi − zi)xj.

since a = Wx, the value ak is equal to Wkx where Wk is the kth row of W. We

consider the two cases ∂(1

2

(yk − zk)2)∂Wi,j where k = i and k 6= i. The first case

reduces to our previous answer −zi(1−zi)(yi−zi)xj. The second case is zero, since

we established that ai, and therefore zi, is calculated using only values in the kth

row of W. The derivative of a sum is the sum of the derivatives, so the derivative

of the loss is equal to our first case plus zero: −zi(1− zi)(yi − zi)xj.

(c) Using your result in (b) show that the gradient of the loss ∂L

∂W

can be written as the

matrix product of two vectors. Hint: By the definition of matrix multiplication,

M = AB iff Mik =

∑

j AijBjk. We can set j = 1 in which case for any two vectors

a and b, M = abT iff Mij = aibj. What are a and b? You can use u v to

indicate elementwise multiplication between two vectors u and v.

Let a = −z × (1 − z) × (y − z) and let b = x, where × represents element-wise

multiplication. From part (b) we have ∂L

∂Wi,j

= aibj, so it follows that the full

matrix ∂L

∂W

= abT .

(d) Derive the simplest expression for ∂L

∂x

. You can use u v to indicate elementwise

multiplication between two vectors u and v.

∂L

∂x

= −Wt(z× (1− z)× (y − z))

COMP 4471 & ELEC 4240 Midterm - Page 9 of 9 14 April 2020

where × represents elementwise multiplication.

Let Li = 12(yi − zi). From (a), we know that

∂Li

∂x

= −zi(1− zi)(yi − zi)WTi

where Wi is the ith row of W. We transpose it because this gradient is a column

vector. Since L =∑i Li and the derivative of a sum is the sum of the derivatives,

∂L

∂x

=

∑

i

−zi(1− zi)(yi − zi)WTi

=

∑

i

−WTi (zi(1− zi)(yi − zi))

since zi(1 − zi)(yi − z) is a scalar. By the definition of matrix multiplication, this

is equivalent to −Wt(z× (1− z)× (y − z)).