COM6509
Data Provided:
None
DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER 2018
Machine Learning and Adaptive Intelligence 2 hours
Answer ALL the questions.
Figures in square brackets indicate the marks allocated to each part of a question,
out of 100.
COM6509 1 TURN OVER
COM6509
This page is blank.
COM6509 2 CONTINUED
COM6509
1. Machine Learning and Probability [Total: 25 marks]
a) Give two examples of supervised learning problems and two examples of unsupervised
learning problems. In each example, state what the output is and what the inputs are.
[5 marks]
b) Let X and Y be two random variables with joint probability distribution P (X, Y ).
Show that the mean of their sum satisfies
E[X + Y ] = E[X] + E[Y ].
[10 marks]
c) We are building a probabilistic model for predicting whether a patient has Meningitis
or not based on some descriptive features related to the symptoms that the patient
exhibits. Our dataset is shown in the following table
ID Headache (H) Fever (F) Vomiting (F) Meningitis (M)
1 true true false false
2 false true false false
3 true false true false
4 true false true false
5 false true false true
6 true false true false
7 true false true false
8 true false true true
9 false true false false
10 true false true true
Symptoms include the presence or absence of headache (column Headache in the
table above), the presence or absence of fever (column Fever in the table above), and
the presence or absence of vomiting (column Vomiting in the Table above). What
is the probability that a patient has Meningitis if they exhibit the following features:
Headache=true, Fever=false, and Vomiting= true? [10 marks]
COM6509 3 TURN OVER
COM6509
2. Linear Regression and Basis Functions [Total: 25 marks]
Consider a regression problem for which each observed output yi has an associated weight
factor ri > 0, such that the sum of squared errors is given as
E(w) =
n∑
i=1
ri(yi −w>φi)2,
where w = [w0, . . . , wm]
> is the vector of parameters, and φi = [φ0(xi), . . . , φm(xi)]> is a
vector-valued function of basis functions.
a) Starting with the expression above, write the sum of squared errors in matrix form.
You should include each of the steps necessary to get the matrix form solution. [HINT:
a diagonal matrix is a matrix that is zero everywhere except for the entries on its main
diagonal. The weight factors ri > 0 can be written as the main diagonal of a diagonal
matrix R of size n× n. ]. [15 marks]
b) Find the optimal value of w, w∗, that minimises the sum of squared errors. The
solution should be in matrix form. Use matrix derivatives. [10 marks]
COM6509 4 CONTINUED
COM6509
3. Bayesian Regression and Naive Bayes [Total: 25 marks]
a) Use one or two sentences to briefly describe the purpose of each of the four components
in Bayes’ rule and write down the Bayes’ rule in these components as a formulae: 1)
prior, 2) likelihood, 3) marginal likelihood, 4) posterior. [8 marks]
b) What does the term marginalise mean in relation to probability distributions? Give
a descriptive answer, and then write down the appropriate formulas for marginalising
both discrete and continuous random variables. [5 marks]
c) Explain the naive Bayes assumption that lets us simplify the expression P (X1 =
v1, · · · , Xd = vd|C = c)P (C = c). [3 marks]
d) Assume we have a random sample that is Bernoulli distributed X1, · · · , Xn ∼ Bernoulli(θ).
We are going to derive the Maximum Likelihood Estimation (MLE) for θ. Recall that
a Bernoulli random variable X takes values in {0, 1} and has probability mass function
given by
P (X; θ) = θX(1− θ)1−X .
Derive the likelihood denoted as L(θ;X1, · · · , Xn) and log likelihood denoted as
`(θ;X1, · · · , Xn). Show your steps. [6 marks]
e) Practical implementations of a Naive Bayes classifier often use log probabilities. Ex-
plain why. [3 marks]
COM6509 5 TURN OVER
COM6509
4. Principal Component Analysis (PCA) and Logistic Regression [Total: 25 marks]
a) You are given a dataset Xtrn with the following properties:
Covariance matrix: C =
[
1 2
3 4
]
Eigenvectors of C: w1 =
[ −0.55
0.83
]
, w2 =
[
0.83
0.55
]
Eigenvalues of C: λ1 = 5, λ2 = 51.
(i) Explain the criterion function that PCA optimises. [2 marks]
(ii) Which direction vector above gives the first principal component for this dataset?
Why? [2 marks]
(iii) How much of the variance in the data is explained by the first principal com-
ponent? [3 marks]
(iv) Which geometrical relation is there between the first and the second principal
component? [1 marks]
(v) Given a test data sample xtst, you are required to get its PCA representa-
tion ytst. Write down, as specifically as possible, a mathematical formula [or
expression] that will allow you get ytst. [5 marks]
(vi) For the training dataset Xtrn, the PCA-transformed dataset is Ytrn. What is
the covariance matrix for the PCA-transformed data? (Hint: no computation
is needed to get the answer.) [4 marks]
b) This question is about binary logistic regression with two output classes.
An experiment is conducted on the toxicity of doses of an insecticide given to the
tobacco budworm moth. In the experiment batches of 20 male moths were exposed
for 3 days to the insecticide and the number in each batch that were dead or knocked
down was recorded. The data are given below.
Dose level 1 2 4 8 16 32
log2(Dose level) 0 1 2 3 4 5
Dead or down 1 4 9 13 18 20
(i) What is the definition of the odds in binary logistic regression. [2 marks]
(ii) What are the maximal and minimal possible values of the odds? [2 marks]
(iii) From the table above, what are the observed odds of dead or down at dose
level 16? [2 marks]
(iv) How to build a classification model that can handle three output classes using
binary logistic regression? [2 marks]
END OF QUESTION PAPER
COM6509 6