xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

Python代写-COM6509

时间：2021-01-16

COM6509

Data Provided:

None

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER 2018

Machine Learning and Adaptive Intelligence 2 hours

Answer ALL the questions.

Figures in square brackets indicate the marks allocated to each part of a question,

out of 100.

COM6509 1 TURN OVER

COM6509

This page is blank.

COM6509 2 CONTINUED

COM6509

1. Machine Learning and Probability [Total: 25 marks]

a) Give two examples of supervised learning problems and two examples of unsupervised

learning problems. In each example, state what the output is and what the inputs are.

[5 marks]

b) Let X and Y be two random variables with joint probability distribution P (X, Y ).

Show that the mean of their sum satisfies

E[X + Y ] = E[X] + E[Y ].

[10 marks]

c) We are building a probabilistic model for predicting whether a patient has Meningitis

or not based on some descriptive features related to the symptoms that the patient

exhibits. Our dataset is shown in the following table

ID Headache (H) Fever (F) Vomiting (F) Meningitis (M)

1 true true false false

2 false true false false

3 true false true false

4 true false true false

5 false true false true

6 true false true false

7 true false true false

8 true false true true

9 false true false false

10 true false true true

Symptoms include the presence or absence of headache (column Headache in the

table above), the presence or absence of fever (column Fever in the table above), and

the presence or absence of vomiting (column Vomiting in the Table above). What

is the probability that a patient has Meningitis if they exhibit the following features:

Headache=true, Fever=false, and Vomiting= true? [10 marks]

COM6509 3 TURN OVER

COM6509

2. Linear Regression and Basis Functions [Total: 25 marks]

Consider a regression problem for which each observed output yi has an associated weight

factor ri > 0, such that the sum of squared errors is given as

E(w) =

n∑

i=1

ri(yi −w>φi)2,

where w = [w0, . . . , wm]

> is the vector of parameters, and φi = [φ0(xi), . . . , φm(xi)]> is a

vector-valued function of basis functions.

a) Starting with the expression above, write the sum of squared errors in matrix form.

You should include each of the steps necessary to get the matrix form solution. [HINT:

a diagonal matrix is a matrix that is zero everywhere except for the entries on its main

diagonal. The weight factors ri > 0 can be written as the main diagonal of a diagonal

matrix R of size n× n. ]. [15 marks]

b) Find the optimal value of w, w∗, that minimises the sum of squared errors. The

solution should be in matrix form. Use matrix derivatives. [10 marks]

COM6509 4 CONTINUED

COM6509

3. Bayesian Regression and Naive Bayes [Total: 25 marks]

a) Use one or two sentences to briefly describe the purpose of each of the four components

in Bayes’ rule and write down the Bayes’ rule in these components as a formulae: 1)

prior, 2) likelihood, 3) marginal likelihood, 4) posterior. [8 marks]

b) What does the term marginalise mean in relation to probability distributions? Give

a descriptive answer, and then write down the appropriate formulas for marginalising

both discrete and continuous random variables. [5 marks]

c) Explain the naive Bayes assumption that lets us simplify the expression P (X1 =

v1, · · · , Xd = vd|C = c)P (C = c). [3 marks]

d) Assume we have a random sample that is Bernoulli distributed X1, · · · , Xn ∼ Bernoulli(θ).

We are going to derive the Maximum Likelihood Estimation (MLE) for θ. Recall that

a Bernoulli random variable X takes values in {0, 1} and has probability mass function

given by

P (X; θ) = θX(1− θ)1−X .

Derive the likelihood denoted as L(θ;X1, · · · , Xn) and log likelihood denoted as

`(θ;X1, · · · , Xn). Show your steps. [6 marks]

e) Practical implementations of a Naive Bayes classifier often use log probabilities. Ex-

plain why. [3 marks]

COM6509 5 TURN OVER

COM6509

4. Principal Component Analysis (PCA) and Logistic Regression [Total: 25 marks]

a) You are given a dataset Xtrn with the following properties:

Covariance matrix: C =

[

1 2

3 4

]

Eigenvectors of C: w1 =

[ −0.55

0.83

]

, w2 =

[

0.83

0.55

]

Eigenvalues of C: λ1 = 5, λ2 = 51.

(i) Explain the criterion function that PCA optimises. [2 marks]

(ii) Which direction vector above gives the first principal component for this dataset?

Why? [2 marks]

(iii) How much of the variance in the data is explained by the first principal com-

ponent? [3 marks]

(iv) Which geometrical relation is there between the first and the second principal

component? [1 marks]

(v) Given a test data sample xtst, you are required to get its PCA representa-

tion ytst. Write down, as specifically as possible, a mathematical formula [or

expression] that will allow you get ytst. [5 marks]

(vi) For the training dataset Xtrn, the PCA-transformed dataset is Ytrn. What is

the covariance matrix for the PCA-transformed data? (Hint: no computation

is needed to get the answer.) [4 marks]

b) This question is about binary logistic regression with two output classes.

An experiment is conducted on the toxicity of doses of an insecticide given to the

tobacco budworm moth. In the experiment batches of 20 male moths were exposed

for 3 days to the insecticide and the number in each batch that were dead or knocked

down was recorded. The data are given below.

Dose level 1 2 4 8 16 32

log2(Dose level) 0 1 2 3 4 5

Dead or down 1 4 9 13 18 20

(i) What is the definition of the odds in binary logistic regression. [2 marks]

(ii) What are the maximal and minimal possible values of the odds? [2 marks]

(iii) From the table above, what are the observed odds of dead or down at dose

level 16? [2 marks]

(iv) How to build a classification model that can handle three output classes using

binary logistic regression? [2 marks]

END OF QUESTION PAPER

COM6509 6

Data Provided:

None

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER 2018

Machine Learning and Adaptive Intelligence 2 hours

Answer ALL the questions.

Figures in square brackets indicate the marks allocated to each part of a question,

out of 100.

COM6509 1 TURN OVER

COM6509

This page is blank.

COM6509 2 CONTINUED

COM6509

1. Machine Learning and Probability [Total: 25 marks]

a) Give two examples of supervised learning problems and two examples of unsupervised

learning problems. In each example, state what the output is and what the inputs are.

[5 marks]

b) Let X and Y be two random variables with joint probability distribution P (X, Y ).

Show that the mean of their sum satisfies

E[X + Y ] = E[X] + E[Y ].

[10 marks]

c) We are building a probabilistic model for predicting whether a patient has Meningitis

or not based on some descriptive features related to the symptoms that the patient

exhibits. Our dataset is shown in the following table

ID Headache (H) Fever (F) Vomiting (F) Meningitis (M)

1 true true false false

2 false true false false

3 true false true false

4 true false true false

5 false true false true

6 true false true false

7 true false true false

8 true false true true

9 false true false false

10 true false true true

Symptoms include the presence or absence of headache (column Headache in the

table above), the presence or absence of fever (column Fever in the table above), and

the presence or absence of vomiting (column Vomiting in the Table above). What

is the probability that a patient has Meningitis if they exhibit the following features:

Headache=true, Fever=false, and Vomiting= true? [10 marks]

COM6509 3 TURN OVER

COM6509

2. Linear Regression and Basis Functions [Total: 25 marks]

Consider a regression problem for which each observed output yi has an associated weight

factor ri > 0, such that the sum of squared errors is given as

E(w) =

n∑

i=1

ri(yi −w>φi)2,

where w = [w0, . . . , wm]

> is the vector of parameters, and φi = [φ0(xi), . . . , φm(xi)]> is a

vector-valued function of basis functions.

a) Starting with the expression above, write the sum of squared errors in matrix form.

You should include each of the steps necessary to get the matrix form solution. [HINT:

a diagonal matrix is a matrix that is zero everywhere except for the entries on its main

diagonal. The weight factors ri > 0 can be written as the main diagonal of a diagonal

matrix R of size n× n. ]. [15 marks]

b) Find the optimal value of w, w∗, that minimises the sum of squared errors. The

solution should be in matrix form. Use matrix derivatives. [10 marks]

COM6509 4 CONTINUED

COM6509

3. Bayesian Regression and Naive Bayes [Total: 25 marks]

a) Use one or two sentences to briefly describe the purpose of each of the four components

in Bayes’ rule and write down the Bayes’ rule in these components as a formulae: 1)

prior, 2) likelihood, 3) marginal likelihood, 4) posterior. [8 marks]

b) What does the term marginalise mean in relation to probability distributions? Give

a descriptive answer, and then write down the appropriate formulas for marginalising

both discrete and continuous random variables. [5 marks]

c) Explain the naive Bayes assumption that lets us simplify the expression P (X1 =

v1, · · · , Xd = vd|C = c)P (C = c). [3 marks]

d) Assume we have a random sample that is Bernoulli distributed X1, · · · , Xn ∼ Bernoulli(θ).

We are going to derive the Maximum Likelihood Estimation (MLE) for θ. Recall that

a Bernoulli random variable X takes values in {0, 1} and has probability mass function

given by

P (X; θ) = θX(1− θ)1−X .

Derive the likelihood denoted as L(θ;X1, · · · , Xn) and log likelihood denoted as

`(θ;X1, · · · , Xn). Show your steps. [6 marks]

e) Practical implementations of a Naive Bayes classifier often use log probabilities. Ex-

plain why. [3 marks]

COM6509 5 TURN OVER

COM6509

4. Principal Component Analysis (PCA) and Logistic Regression [Total: 25 marks]

a) You are given a dataset Xtrn with the following properties:

Covariance matrix: C =

[

1 2

3 4

]

Eigenvectors of C: w1 =

[ −0.55

0.83

]

, w2 =

[

0.83

0.55

]

Eigenvalues of C: λ1 = 5, λ2 = 51.

(i) Explain the criterion function that PCA optimises. [2 marks]

(ii) Which direction vector above gives the first principal component for this dataset?

Why? [2 marks]

(iii) How much of the variance in the data is explained by the first principal com-

ponent? [3 marks]

(iv) Which geometrical relation is there between the first and the second principal

component? [1 marks]

(v) Given a test data sample xtst, you are required to get its PCA representa-

tion ytst. Write down, as specifically as possible, a mathematical formula [or

expression] that will allow you get ytst. [5 marks]

(vi) For the training dataset Xtrn, the PCA-transformed dataset is Ytrn. What is

the covariance matrix for the PCA-transformed data? (Hint: no computation

is needed to get the answer.) [4 marks]

b) This question is about binary logistic regression with two output classes.

An experiment is conducted on the toxicity of doses of an insecticide given to the

tobacco budworm moth. In the experiment batches of 20 male moths were exposed

for 3 days to the insecticide and the number in each batch that were dead or knocked

down was recorded. The data are given below.

Dose level 1 2 4 8 16 32

log2(Dose level) 0 1 2 3 4 5

Dead or down 1 4 9 13 18 20

(i) What is the definition of the odds in binary logistic regression. [2 marks]

(ii) What are the maximal and minimal possible values of the odds? [2 marks]

(iii) From the table above, what are the observed odds of dead or down at dose

level 16? [2 marks]

(iv) How to build a classification model that can handle three output classes using

binary logistic regression? [2 marks]

END OF QUESTION PAPER

COM6509 6