PAPER CODE NO. EXAMINER: Xiaowei Huang Tel. No. 07831378101
COMP219 DEPARTMENT: Computer Science
FIRST SEMESTER EXAMINATIONS 2020/21
Advanced Artificial Intelligence
TIME ALLOWED : TWO Hours and Thirty Minutes
INSTRUCTIONS TO CANDIDATES
NAME
OF CANDIDATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . SEAT NO . . . . . . . . . . . . . . . .
USUAL SIGNATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
READ THE FOLLOWING CAREFULLY:
1. Each of the following questions comprise 5 statements, from which you should select one
appropriate answer by placing ticks in the appropriate boxes.
2. The exam mark is based on the overall number of correctly answered questions. The more
questions you answer correctly the higher your mark, incorrectly answered questions do not
count against you.
3. Enter your name and examination number IN PENCIL on the computer answer sheet
according to the instructions on that sheet.
4. When you have completed this exam paper, read the instructions on the computer answer
sheet carefully and transfer your answers from the exam paper. Use a HB pencil to mark the
computer answer sheet and if you change your mind be sure to erase the mark you have
made. You may then mark the alternative answer.
5. At the end of the examination, be absolutely sure to hand in BOTH this exam paper AND the
computer answer sheet.
6. Calculators are permitted.
THIS PAPER MUST NOT BE REMOVED FROM THE EXAMINATION ROOM
PAPER CODE COMP219 page 1 of 18 Continued
Part 1: Basic Knowledge
1. Which learning task best suits the following description: given a set of training instances
{(x (1), y (1)), ..., (x (n), y (n))} of an unknown target function f , where x (i) is the feature vector and
y (i) is the label for i ∈ {1, ...,n}, it outputs a model h that best approximates f .
A. Unsupervised learning
B. Supervised learning
C. Reinforcement learning
D. Semi-supervised learning
E. none of the above
2. Which learning task best suits the following description: given a set of training instances
{x (1), ..., x (n)}, where x (i) is the feature vector for i ∈ {1, ...,n}, a model h which represents
each x (i) with a lower-dimension feature vector while still preserving key properties of the data
is output.
A. Unsupervised learning
B. Clustering analysis
C. Dimensionality reduction
D. Anomaly detection
E. Supervised Learning
Figure 1: Joint probability for student grade and intelligence
3. Compute the following probability according to the table in Figure 1
P(Intelligence = Low) =
A. 0.28
B. 0.7
C. 0.35
D. 0.07
E. 0.18
PAPER CODE COMP219 page 2 of 18 Continued
4. Compute the following conditional probability according to the table in Figure 1
P(Intelligence = low | Grade = B) =
A. 0.28/0.37
B. 0.07/0.25
C. 0.35/0.7
D. 0.28/0.7
E. 0.09/0.37
Figure 2: Diagram for y = sin(θ) function
5. Which of the following statements is correct according to Figure 2.
A. maxθ sin(θ) = 0
B. maxθ sin(θ) ≤ 1
C. 0.5pi is not in arg maxθ sin(θ)
D. pi is in arg maxθ sin(θ)
E. maxθ sin(θ) > 1
PAPER CODE COMP219 page 3 of 18 Continued
Figure 3: Probabilistic Graph of Diseases (A) and Symptom (B)
6. Use the information provided in Figure 3 to compute the following joint probability
P(A = a1,B = b0) =
A. 0.12
B. 0.36
C. 0.3
D. 0.48
E. 0.16
7. Use the information provided in Figure 3 to compute the following expression
max
A,B
P(A,B) =
A. 0.48
B. a1,b1
C. 0.5
D. a0,b1
E. 0.36
8. Use the information provided in Figure 3 to compute the following maximum a posteriori
expression
MAP(A,B) =
A. 0.36
B. a1,b1
C. 0.5
D. a0,b1
E. a1,b1
PAPER CODE COMP219 page 4 of 18 Continued
9. Understanding simple numpy command.
Assume that a = np.arange(10).reshape((2, 5)). Then a.T .shape =
A. 10
B. (2,5)
C. (5,2)
D. 2
E. 5
10. Let x = (1, 2, 3,−4) be a vector. Then its L2 norm ||x ||2 =
A. 10
B.
√
30
C. 4
D. 3
E.
√
10
11. Let x = (1, 2, 3,−4) be a vector. Then its L1 norm ||x ||1 =
A. 2
B.
√
30
C. 10
D. 4
E. 3
PAPER CODE COMP219 page 5 of 18 Continued
Part 2: Simple Learning Models
Figure 4: Decision Trees
12. Which decision trees in Figure 4 can represent the Boolean formula (x2 ∧ x5) ∨ (x3 ∧ ¬x1)?
A. A
B. B
C. C
D. A and C
E. none of the above
13. Figure 5 gives an example dataset D about playing tennis. Please indicate which of the
following expressions is used to compute its entropy HD(Y ), where Y is the random variable
for labelling:
A. − 8
14
log2(
8
14
)− 6
14
log2(
6
14
)
B. 8
14
log2(
8
14
) +
6
14
log2(
6
14
)
C. − 8
14
log2(
6
14
)− 6
14
log2(
8
14
)
D. 8
14
log2(
6
14
) +
6
14
log2(
8
14
)
E. 8
14
log2(
8
14
)− 6
14
log2(
6
14
)
PAPER CODE COMP219 page 6 of 18 Continued
Figure 5: Dataset for playing tennis
14. Figure 5 gives an example dataset D about playing tennis. Please compute the information
gain of splitting over the feature Wind InfoGain(D,Wind) = HD(Y )− HD(Y |Wind):
A. 0.985
B. −0.189
C. 0.128
D. 0.151
E. 0.048
15. Figure 5 gives an example dataset D about playing tennis. Please compute the information
gain of splitting over the feature Wind InfoGain(D,Humidity ) = HD(Y )− HD(Y | Humidity):
A. 0.258
B. −0.189
C. 0.128
D. 0.151
E. 0.048
PAPER CODE COMP219 page 7 of 18 Continued
Figure 6: A set of two-dimensional input samples
16. Assume that, as shown in Figure 6, we have a set of training instances with two features X1
and X2:
{(0.5, 3), (0.5, 0.5), (1, 0.5), (1, 2), (1.5, 0.5), (2, 4), (2.5, 3), (3, 0.5), (3, 3.5), (3.5, 4)}
such that
• the instance (0.5, 3) is labeled with value 0,
• the instances (0.5, 0.5), (1, 0.5), (1, 2), (1.5.0.5) are labeled with value 1,
• the four instances (2, 4), (2.5, 3), (3, 0.5), (3, 3.5) are labeled with value 2, and
• the instance (3.5, 4) is labeled with value 3.
Now, we have a new input (2.5, 1.8). Please indicate which of the following points are not
considered for the 3-nn (3-nearest neighbor) classification, according to the Manhattan (L1)
distance.
A. Both (1.5, 0.5) and (3, 0.5)
B. Both (1.5, 0.5) and (2, 4)
C. (3, 0.5)
D. Both (3, 3.5) and (1, 2)
E. Both (1.5, 0.5) and (3, 0.5)
17. Continue with the above. Now, for new input (2.5, 1.8), please compute its regression result
for the 3-nn (3-nearest neighbor) regression, according to the Manhattan distance.
A. 5/3
B. 6.1/3
C. 6.2/3
D. 2.1
E. 2.2
PAPER CODE COMP219 page 8 of 18 Continued
18. Please select the correct statement from the following:
A. Validation dataset is another terminology for test dataset
B. Validation dataset is part of the training dataset
C. Validation dataset is part of the test dataset
D. Validation dataset cannot be used for regularization
E. Test dataset can be overlapped with the training dataset
Figure 7: A confusion matrix for the two-class problem
19. Assume a two-class problem where each instance is classified as either 1 (positive) or -1
(negative). We have a training dataset of 1,000 instances, such that 550 of them are labeled
as 1 and 450 of them are labeled as -1. After training, we apply the trained model to classify
the 1,000 instances and find that 850 instances are classified correctly. Moreover, we know
that, 500 instances are classified as 1 and, within the 500 instances, 50 instances are actually
labeled as -1. Please indicate which numbers should be filled in to (A,B,C,D) in Figure 7.
A. (450, 50, 150, 350)
B. (450, 50, 100, 400)
C. (400, 100, 100, 400)
D. (450, 100, 50, 400)
E. (400, 50, 150, 400)
20. Continue with the above question. Please compute the error rate of the trained model.
A. 50/1000
B. 150/1000
C. 100/1000
D. 150/2000
E. 150/850
PAPER CODE COMP219 page 9 of 18 Continued
21. Given training data {(x (i), y (i)) | 1 ≤ i ≤ m}, linear regression is used to find a linear function
f (x) = wTx that
A. minimises the loss Lˆ = 1
m
m∑
i=1
(wTx (i))
B. minimises the loss Lˆ = 1
m
m∑
i=1
(σ(wTx (i))− y (i))2, where σ(a) = 1
1 + exp(−a)
C. minimises the loss Lˆ = 1
m
m∑
i=1
(wTx (i) − y (i))2
D. minimises the loss Lˆ = 1
m
m∑
i=1
(log(wTx (i))− y (i))2
E. maximises the loss Lˆ = 1
m
m∑
i=1
(log(wTx (i))− y (i))2
22. Let f (X ) = 3X 21 + 4X
3
2 + 5X3 be a function, where X1, X2 and X3 are three variables. Please
indicate which of the following gradient expressions is correct:
A. ∇X f (X ) = 6X1
B. ∇X f (X ) = (3, 4, 5)
C. ∇X f (X ) = (6X1, 12X2, 5)
D. ∇X f (X ) = (6X1, 12X 22 , 5)
E. ∇X f (X ) = (6X1, 24X2, 5)
23. Naive Bayes method is based on the following assumption, where Xi for i ∈ {1..n} represent
features of an instance and Y represents the parameter:
A. P(X1, ...,Xn) =
n∏
i=1
P(Xi)
B. P(X1, ...,Xn) = 1−
n∏
i=1
P(Xi)
C. P(X1, ...,Xn | Y ) =
n∏
i=1
P(Xi | Y )
D. P(X1, ...,Xn) =
n∑
i=1
P(Xi)
E. P(X1, ...,Xn | Y ) =
n∑
i=1
P(Xi | Y )
PAPER CODE COMP219 page 10 of 18 Continued
Part 3: Deep Learning
24. Which of the following is one of the ingredients of Rosenblatt’s perceptron ?
A. backward propagation for learning
B. a simple learning algorithm that adapts the weights according to the samples
C. multiple perceptrons on a single layer
D. multi-layer perception
E. None of the above
25. Which of the following statements is correct ?
A. exclusive or (XOR) cannot be solved by single perceptron
B. Multi-layer perceptrons cannot solve XOR
C. Rosenblatt’s algorithm can be used to train multi-layer perceptrons
D. Multi-layer perceptrons cannot solve OR and AND
E. none of the above
26. Which of the following does not contribute as a key reason for the success of deep learning?
A. Better hardware
B. Bigger data
C. recurrent neural network architectures such as LSTM
D. Better optimization methods, such as Adam, batch normalization
E. All of them are key reasons
PAPER CODE COMP219 page 11 of 18 Continued
Figure 8: A simple 3-layer neural network
27. Figure 8 gives a simple 3-layer neural network with 3 inputs x1, x2, x3 and a single output z.
Let z = y1 + 5y2 + 2, y1 = 2x1 + 3x2 + x3 + 1, y2 = 3x1 + x2 + x3 − 2. Please indicate which of
the following expression is correct for the gradient ?
A. ∂z
∂x1
= 15
B. ∂z
∂x3
= 17
C. ∂z
∂x2
= 8
D. ∂z
∂x2
= 10
E. ∂z
∂x3
= 5
Figure 9: A two-dimensional input and a convolutional filter
28. The following four questions are related to Figure 9. In Figure 9, we have a two-dimensional
input and a convolutional filter. Given stride = 1, please indicate which of the following state-
ment is correct if zero-padding is not applied ?
A. the result is a one dimensional array of length 9
B. the result is a one dimensional array of length 16
C. the result is a two dimensional array of shape (3, 3)
D. the result is a two dimensional array of shape (4, 4)
E. None of the above is correct
PAPER CODE COMP219 page 12 of 18 Continued
29. Continue with the above question. Please indicate which of the following statements is correct
for the result of applying the convolutional filter on the input ?
A. there isn’t an element 45
B. there is an element 39
C. there isn’t an element 34
D. there isn’t an element 47
E. None of the above is correct
30. Take the same input as in Figure 9 and apply max-pooling on 2×2 filter. Please indicate
which of the following statement is correct ?
A. the result is a one dimensional array of length 2
B. the result is a one dimensional array of length 4
C. the result is a two dimensional array of shape (2, 2)
D. the result is a two dimensional array of shape (4, 4)
E. None of the above is correct
31. Continue with the above question. Please indicate which of the following statement is correct
for the result of applying max-pooling on 2×2 filter (stride 1) on the input ?
A. there is a single element with value 7
B. there is a single element with value 9
C. there are two elements with value 7
D. there are three elements with value 9
E. None of the above is correct
32. Which of the following statements is incorrect with respect to the features and feature mani-
folds ?
A. It is very often that high-dimensional data lie in lower dimensional feature manifolds.
B. The computation of the coordinates of the data with respect to feature manifolds enables
an easy separation of the data.
C. Feature manifolds are linear, so it is easier to compute.
D. In an end-to-end learning of feature hierarchy, initial modules capture low-level features,
middle modules capture mid-level features, and last modules capture high level, class
specific features.
E. None of the above is correct
PAPER CODE COMP219 page 13 of 18 Continued
Part 4: Probabilistic Graphical Models
33. Which of the following is key for Bayesian Networks to represent joint probability distribution:
A. graph and conditional probability distributions
B. chain rules and conditional probability distributions
C. chain rules and joint probability distribution table
D. graph and joint probability distribution table
E. None of the above
Figure 10: Simple Probabilistic Graphical Model
34. Figure 10 provides a simple probabilistic graphical model of three variables S, G, and I. We
already know that
P |= (S⊥G | I)
Which of the following is the value of P(i0, s1,g2)?
A. 0.0410
B. 0.0119
C. 0.0408
D. 0.458
E. 0.216
PAPER CODE COMP219 page 14 of 18 Continued
Figure 11: Joint probability of three random variables
35. Figure 11 (a) provides a joint probability P. Let I(P) to be the set of conditional independence
assertions of the form (X⊥Y |Z ) that hold in P. Which of the following is correct?
A. (X⊥Y | Z ) ∈ I(P)
B. (Y⊥Z | X ) ∈ I(P)
C. (X⊥Y ) ∈ I(P)
D. I(P) = ∅
E. None of the above is correct
36. Figure 11 (b) provides a joint probability P. Let I(P) to be the set of conditional independence
assertions of the form (X⊥Y |Z ) that hold in P. Which of the following is correct?
A. (X⊥Z | Y ) 6∈ I(P)
B. (Y⊥Z | X ) 6∈ I(P)
C. (X⊥Y ) 6∈ I(P)
D. I(P) = ∅
E. None of the above is correct
PAPER CODE COMP219 page 15 of 18 Continued
Figure 12: A Bayesian network
37. Consider the Bayesian network model G in Figure 12 and indicate which of the following is
not in I(G):
A. (L⊥I,D,S | G)
B. (G⊥S | D, I)
C. (I⊥D)
D. (D⊥I,S | G)
E. None of the above is correct
38. Consider the Bayesian network model G in Figure 12 and calculate the following value
P(i1,d0,g2, s1, l0) =
A. 0.004608
B. 0.4608
C. 0.5329
D. 0.001435
E. 0.101435
PAPER CODE COMP219 page 16 of 18 Continued
39. Consider the probabilistic graphical model in Figure 12 and indicate which of the following
statements are correct.
A. We can do evidential reasoning by computing P(l1) and P(l1 | i0,d0)
B. We can do causal reasoning by computing P(i1) and P(i1 | l0,g0)
C. We can do intercausal reasoning by computing P(i1) and P(i1 | l0,g0)
D. We can do causal reasoning by computing P(l1) and P(l1 | d0)
E. None of the above is correct
Figure 13: A simple probabilistic graphical model
40. Consider the probabilistic graphical model in Figure 13. Please indicate which of the following
statement is incorrect.
A. D can influence L when G is observed
B. D can influence I when G is not observed and L is observed
C. G can influence S when I is not observed
D. D can influence I when G is observed
E. None of the above is correct
PAPER CODE COMP219 page 17 of 18 Continued
This page collects some formulas/expressions that may be used in this exam.
1. entropy:
−
∑
y∈values(Y )
P(y ) log2 P(y )
2. conditional entropy:
H(Y |X ) =
∑
x∈values(X )
P(X = x)H(Y |X = x)
where
H(Y |X = x) = −
∑
y∈values(Y )
P(Y = y |X = x) log2 P(Y = y |X = x)
PAPER CODE COMP219 page 18 of 18 End