Office Use Only
Semester One 2019
Examination Period
Faculty of Information Technology
EXAM CODES: FIT3181
TITLE OF PAPER: Final Examination Paper for Unit FIT3181 Deep Learning
EXAM DURATION: 2 hours writing time
READING TIME: 10 minutes
THIS PAPER IS FOR STUDENTS STUDYING AT: (tick where applicable)
□ Caulfield □ Clayton □ Parkville □ Peninsula
□Monash Extension □ Off Campus Learning □ Malaysia □ Sth Africa
□ Other (specify)
During an exam, you must not have in your possession any item/material that has not been authorised for
your exam. This includes books, notes, paper, electronic device/s, mobile phone, smart watch/device,
calculator, pencil case, or writing on any part of your body. Any authorised items are listed below.
Items/materials on your desk, chair, in your clothing or otherwise on your person will be deemed to be in
your possession.
No examination materials are to be removed from the room. This includes retaining, copying, memorising
or noting down content of exam material for personal use or to share with any other person by any means
following your exam.
Failure to comply with the above instructions, or attempting to cheat or cheating in an exam is a discipline
offence under Part 7 of the Monash University (Council) Regulations, or a breach of instructions under Part
3 of the Monash University (Academic Board) Regulations.
AUTHORISED MATERIALS
OPEN BOOK □ YES □ NO
CALCULATORS □ YES □ NO
SPECIFICALLY PERMITTED ITEMS □ YES □ NO
if yes, items permitted are:
Candidates must complete this section if required to write answers within this paper
STUDENT ID: __ __ __ __ __ __ __ __ DESK NUMBER: __ __ __ __ __
FIT3181 Deep Learning Final Exam Paper, T1, 2019 | Page 1 of 9
PART A: Multiple-Choice Questions
● This part contains 8 multiple-choice questions
● The total number of marks for this part is 15
● For multiple-choice question, you must select all applicable answers to receive the full
mark for that question.
Question A1 [1 mark]
In a CNN, unlike the convolutional layer, the pooling layer has no learnable parameters
a) True
b) False
Solution: (a)
Question A2 [3 marks]
Consider a machine learning problem to detect breast cancer from a dataset consisting of
mammograms and medical data collected from a cohort of patients. The task is to predict whether
a patient has breast cancer. The following table summarizes the confusion matrix on the test
dataset. Select all applicable answers below:
True Labels
Pre
dic
ted
Cla
ss
CANCER (1) NORMAL (0)
CANCER (1) 9 10
NORMAL (0) 1 90
a) The total number of instances is 110 with 10 labelled as CANCER and 100 labelled as
NORMAL
b) The true positive rate (TPR) is 9/10 = 90%
c) This test dataset is a balanced dataset
d) It is not possible to calculate the AUC because we don’t have the ROC curve which requires
the performance information at different level of thresholds
Solution: (a), (b), (d)
Question A3 [2 marks]
When training a deep neural network (DNN), which of the following statements are applicable?
a) One can use a Stochastic Gradient method known as the Back Propagation algorithm to
optimize the loss function
b) The Back Propagation method is robust against overfitting, hence likely to always produce a
good global optimal solution
c) The learning rate is an important parameter when training a DNN
d) With TensorFlow, a simple way to detect the problem of gradient vanishing is to draw the
histograms of the gradients and visually inspect them in TensorBoard
Solution: (a), (c), (d)
FIT3181 Deep Learning Final Exam Paper, T1, 2019 | Page 2 of 9
Question A4 [2 marks]
Factors which have driven recent success in deep learning include:
a) Deep learning models are generally very flexible and powerful
b) Modern advancements in hardware have enabled fast distributed and parallel computation
c) The recent release of iPhone X has enabled centralized computation on a single device
d) The availability of massive scale modern datasets has enabled computer scientists to train
powerful machine learning models
Solution: (a), (b), (d)
Question A5 [2 marks]
Which of the following statement are true regarding the Gradient Descent (GD) method when
applied to minimize the objective function J(w):
a) With appropriate learning rate, GD guarantees to converge to a global minimum if J(w) is a
convex function
b) With appropriate learning rate, GD always guarantees to converge to some local minimum
if J(w) is a nonconvex function
c) GD updates the parameter in the opposite direction of the current gradient
d) GD is a second-order optimization method
e) GD runs much faster than Stochastic GD for large datasets
Solution: (a), (c)
Question A6 [2 marks]
With respect to the Pooling layer in a CNN, which of the following statements are true?
a) It operates at the combination of multiple activation maps to produce a dependent output
b) It reduces the resolution of the image, hence provides a better computational efficiency
c) Max-pooling is locally invariant in the sense that input numbers within a local filter window
can be shuffled without changing the final output
d) Output tensor of a max-pooling layer will always have the same depth with the input tensor
Solution: (b), (c), (d)
Question A7 [1 mark]
Consider a text modelling problem where a corpus of texts is given and contains two words ‘king’
and ‘queen’. If these two words are to be represented as one-hot encoding vector, what is the
possible value computed for their cosine similarity?
a) 0
b) 0.25
c) 0.5
d) 1.0
e) None of above
Solution: (a)
Question A8 [2 marks]
FIT3181 Deep Learning Final Exam Paper, T1, 2019 | Page 3 of 9
You were given a corpus of texts which is assumed to be sufficiently large to learn semantic
meaning of words. After applying word2vec embedding on this corpus, each word had been
associated with a vector. As you have learned from the lecture, this now allows you to perform
analogical reasoning. Which answer do you expect if we reason “mom – dad = ? - man”
a) child
b) girl
c) mother
d) woman
Solution: (d)
PART B: Short Workout Questions
● This part contains 6 workout questions
● The total number of marks for this part is 35
Question B1 [3 marks]
Draw the computational graph for the function , , ( ) = (2 + 2 + )
Solution:
Question B2 [5 marks]
Generative Adversarial Networks (GAN) is a deep generative model that uses a generator
to simulate data by mapping a random through . Given a training dataset
θ
( ) ~
θ
whose empirical distribution is denoted by , GAN further uses a = {
, = 1…, }
discriminator to distinguish if a data point is real or fake, hence training GAN is a form
ψ
()
of minimax optimization approach. Both and are deep NNs in the formulation of GAN.
θ
ψ
Write down the minimax objective function of GAN with respect to and .θ ψ
Solution:
Students should be able to reason that the generator G tries to maximize the probability of examples
comes from data distribution P_data, while the discriminator tries to compete by raising the
probability of examples being simulated
θ, ψ( ) =
~
(ln
θ
()) +
~
ln (1 −
ψ
(() )
Question B3 [5 marks]
Consider a convolutional layer applied to an RGB image whose width and height are 32 and 32
respectively with 100 filters of size 5x5x3 were used.
a) What is the depth of the convolutional layer?
b) What is the dimension of the output layer if stride size = 3 and zero padding size = 1?
Solution:
a) Depth: 100
b) Width = height = (32 + 2 * 1 – 5) // 3 + 1 = 10 so the dimension of the output layer is
10x10x100.
FIT3181 Deep Learning Final Exam Paper, T1, 2019 | Page 4 of 9
Question B4 [5 marks]
Given the two discrete distributions of the same size and , = [
1
, ..,
] = [
1
, …,
]
a) What are the values of and ?
=1
∑
=1
∑
b) Write down the expression for )_(,
Solution:
a) 1, 1
b) -
=1
∑
(
)
Question B5 [6 marks]
Consider the following Deep NN for classification that we have encountered in the class where
the softmax function has been used for the classification task.
a) If the last output layer has the value of [ln3, ln4], what would be the value forℎ()
the output vector y?
b) This network has an input layer, an output layer and three hidden layers. What is the
total number of learnable parameters if back propagation is used?
Solution:
a) [ , ]= [0.429, 0.571]33+4
4
3+4
b) The total number of learnable parameters:
(5 x 7 + 7) + (7 x 5 + 5) + (5 x 4 + 4) + (4 x 2 + 2) = 116
Question B6 [11 marks]
Consider a simple two time-slice RNN (without output) below.
FIT3181 Deep Learning Final Exam Paper, T1, 2019 | Page 5 of 9
a) Write down the mathematical expressions for and assuming tanh() activationℎ
0
ℎ
1
function is used.
b) Using the static_rnn() function in TensorFlow, we can declare this network as
where n_inputs and n_neurons are the dimensions of the input vector and hidden
vector respectively given in advance.
Describe the key shortcomings of using static_rnn() function in Tensorflow to build
RNNs in general.
c) Assuming that the above RNN network structure is extended to 100 time-slices, using
the dynamic_rnn() function write your TensorFlow code to declare this new RNN
network.
Solution:
a) ℎ
0
= tanh ℎ
0
+ ( ) ℎ1 = ℎ(1 + ℎ0 + )
b) Shortcoming of static_rnn:
- Takes a list of tensor of shape [batch size, input size] as inputs => must pass the
list of variables at each time step
- Creates an unrolled graph of fixed length so graph creation is slow.
- Inefficient when dealing with inputs of variable sequence length.
- Unable to deal with sequence of greater length than the fixed length of the graph
created.
c) Code to use dynamic_rnn:
Inputs = tf.placeholder(tf.float32, [None, None, n_inputs])
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.contrib.rnn.dynamic_rnn(basic_cell, inputs, dtype=tf.float32)
PART C: Short Answer Questions
● This part contains 5 short-answer questions
● The total number of marks for this part is 50
Question C1 [5 marks]
FIT3181 Deep Learning Final Exam Paper, T1, 2019 | Page 6 of 9
What is the early stopping strategy to address overfitting in training deep NNs? Draw a plot
(training epoch versus performance) to illustrate the behaviour of this strategy with respect to
the validation set.
Question C2 [5 marks]
Describe the key advantages of using a CNN architecture over a fully connected DNN for image
classification task.
Solution:
- CNN has many fewer parameters due to reusing weight, hence faster to train, reduces risk of
overfitting and requires less training data.
- Once CNN has learned a kernel to detect a particular pattern, it can detect that pattern
anywhere on the image. In contrast, when a DNN learn a feature in one location, it can detect
it only in that particular location. Images have these repetitive features, hence NCCs are able
to generalize better than DNN for image processing tasks with less data.
- DNN has no knowledge of how pixels are organized; it does not know nearby pixels are similar.
CNN embeds this prior knowledge in its architecture. Lower layers capture patterns in small
areas of images, higher-layers combines and captures higher patterns.
Note: if a student misses the last point, s/he can still earn 80% of the mark for this question
Question C3 [12 marks]
Most of supervised learning tasks in machine learning and deep learning reduce to the
following optimization form:
θ( ) = Ω θ( ) + 1
=1
∑ (
, (
, θ))
where is the training dataset and is the learning function
,
{ }=1
(, )
a) What is the role of the first term and what can it be used for? Give a popular choice forΩ(θ)
used in practice of deep learningΩ θ( )
b) What is the role of the term ? And what are popular choices for this term when
,
, θ( )( )
applied to classification and regression problems using Deep NNs.
Solution:
a) It is the regularization term to penalize too complex models. Popular choice is l1, l2
b) It is the loss function measuring the loss incurred from the prediction in comparison
with the true outcome. Classification: cross-entropy, regression: squared error l2
Question C4 [20 marks]
With additional information to the Long Short-Term Memory (LSTM) cell architecture provided
below for your reference
a) List two main problems that LSTM can overcome but RNNs are inadequate to address
b) Which variables represent the long-term and short-term state respectively?
c) Write the mathematical expression for the forget gate and explain its role
d) Write the mathematical expression for the input gate and explain its role
FIT3181 Deep Learning Final Exam Paper, T1, 2019 | Page 7 of 9
e) Write the mathematical expressions for , and
ℎ
f) What are the peephole connections and how they might help to improve the basic LSTM
cell?
Solution:
a) Two problems are:
- Gradient vanishing: self-connection in LSTM helps maintain gradient during
backpropagation
- Long-term dependency problem: LSTM is much better at solving long-term dependency problem
- Weight input/output update conflict: Consider the weight connecting node i to node j,
at some time step it might want to retain information in i by turning j on and keep j from
being switching off in later steps; but at other time steps it might want j to ignore input
from i. Similarly, consider , at some time step it might want to retrieve information
from j, and at some other time steps it might want to prevent j from perturbing k. LSTM
solves this problem by introducing context-sensitive gating mechanism to protect nodes (or
cells) from being disturbed by irrelevant inputs.
b) and_ ℎ
c)
= (ℎ−1 +
+ )
The forget gate helps the network to be selective about the long-term memory to retain
and allows irrelevant (or obsolete) information to decay. Technically, forget gates keeps
from growing to large in magnitude, causing gradient saturating for nonlinear activation
function such as tanh.
d) . The input gate learns what relevant information to add to
= (ℎ−1 +
+ )
the long-term memory .
e) , ,
= σ(ℎ
−1
+
+ )
=
⊙
−1
+
⊙
ℎ
=
⊙ tanh ℎ
f) Gates in LSTM cell only use information from and and don’t make use of the long-term stateℎ
−1
. Peepholes connection allows as inputs to forget gate and input get and current
−1
−1
to output gate
FIT3181 Deep Learning Final Exam Paper, T1, 2019 | Page 8 of 9
Question C5 [8 marks]
What is an autoencoder? List four main tasks that AutoEncoders can be used for? If an
AutoEncoder perfectly reconstructs its inputs, is it necessarily a good encoder and why?
Solution:
- An autoencoder is a special type of neural network structures that learn internal
representation of its inputs to reconstruct themselves, hence in autoencoder the dimension of
the input must be the same as the dimension of the output
- Note: Student can still get full mark if the wordings are different but the key points remains the
same.
- Main tasks autoencoder can be used for:
o Learning representation and feature extraction
o Unsupervised pretraining
o Dimensionality reduction
o Generative models to generate data
o Abnormality detection
- No, perfectly reconstructs its input is not necessarily is a good encoder as this usually implies
no useful internal representation has been learned. It could be simply an overcomplete
auto-encoder which simply copy its inputs to its outputs because the internal architecture is
too powerful. However, if an autoencoder produces bad reconstruction data, then it is not a
good autoencoder.
END OF EXAM PAPER
FIT3181 Deep Learning Final Exam Paper, T1, 2019 | Page 9 of 9
学霸联盟学霸联盟