程序代写案例-CMPT 726-Assignment 2
时间:2022-03-10
CMPT 726 Machine Learning
Spring 2022 Assignment 2
Due: Monday, March 14 at 11:59 pm Pacific Time
This assignment comes with a zip archive containing starter code, which is required to complete
the programming component of this assignment. Each part is equally weighted, and partial credit
will be awarded to reasonable attempts at solving each question, even if they are not completely
correct.
This assignment is designed to be substantially more challenging than the quizzes and requires
thorough understanding of the course material and extensive thought, so start early! If you get
stuck, post on the discussion board or come to office hours. Make sure to check the discussion
board often for updates, clarifications, corrections and/or hints.
There will be two office hours dedicated to assignment-related questions in the week the assign-
ment is due and in the preceding week. Times and Zoom links will be posted on Canvas. If you
have a question that you would like to ask during office hours, make sure to post a brief summary
of your question in the office hours thread on the Canvas discussion board. During office hours,
questions will be answered in the order they appeared on the office hours thread. We may not be
able to get to all questions, so please start early and plan ahead.
Requests for extensions will not be considered under any circumstances – make sure you know
how to submit your assignment on Canvas and leave sufficient time to deal with any technical
difficulties, Internet outages and server downtimes. It is possible to update your assignment after
submission, so we recommend uploading a version of your assignment at least several hours before
the deadline, even if it is incomplete.
Warning: Copying others’ solutions, seeking help from others not in this course or posting ques-
tions online are considered cheating. Consequences are severe and could lead to suspension or
expulsion. If you become aware of such instances, you have an affirmative duty to report them.
They should be reported at https://forms.gle/8FGsMop3KP9A6jEL6.
Submission Instructions:
Carefully follow the instructions below when submitting your assignment. Not following instruc-
tions will result in point deductions.
1. Each part should start on a new page. Specify the page numbers of the start of each part
both on the first page of your PDF submission and on Canvas. Submit your PDF to Crowd-
mark by signing into Crowdmark with Canvas and then selecting this course, followed by
“Assignment 2”. You will need to upload your cover page, declaration of collaborators and
solutions for each part. To do so, you can either upload the same PDF for each part and
remove the irrelevant pages directly in Crowdmark, or split the PDF into different parts on
your computer and upload each part. Submit your code as a zip archive to Canvas. The di-
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 1
rectory structure and contents of the zip file should be as prescribed in the instructions below
for programming questions. The submission portal is located at Assignments → Assignment
2 → Assignment 2 Code.
We strongly encourage typesetting your solutions in LaTeX using the assignment template
posted on Canvas. If you have not used LaTeX before, you can follow the LaTeX tutorial
linked from Canvas. You may also type your solutions using Word, but please make sure
to submit in PDF format as opposed to .doc/.docx format. Neatly handwritten and scanned
solutions are acceptable as well; however, please note that we will not be able to give credit
to solutions that are not legible and reserve the right to deduct points for messy handwriting.
If there are graphs, include those graphs in the correct parts. Do not put them in an appendix.
We need the solution to each part to be self-contained on pages of its own.
• At the start of each question, please state: (1) who you received help from on the prob-
lem, and (2) who you provided help to.
• On the first page of your PDF submission, please copy the following statement and sign
your signature next to it. (Mac Preview and FoxIt PDF Reader, among others, have tools
to let you sign a PDF file.) We want to make it extra clear so that no one inadvertently
cheats.
“I certify that all solutions are entirely in my own words and that I have not looked at
another student’s solutions. I have given credit to all external sources I consulted.”
2. For all theory questions, make sure to show your work and include all key steps in your
derivations and proofs. If you apply a non-trivial theorem or fact, you must state the theorem
or fact you used, either by name (e.g.: Jensen’s inequality) or by writing down the general
statement of the theorem or fact. You may use theorems and facts covered in lecture without
proof. If you use non-trivial theorems and facts not covered in lecture, you must prove them.
All conditions should be checked before applying a theorem, and if a statement follows
from a more general fact, the reason why it is a special case of the general fact should be
stated. The direction of logical implication must be clearly stated, e.g.: if statement A implies
statement B, statement B implies statement A, or statement A is equivalent to statement B
(i.e.: statement A is true if and only if statement B is true). You should number the equations
that you refer to in other parts and refer to them by their numbers. You must highlight the
final answer or conclusion in your PDF submission, either by changing the font colour to
red or drawing a box around it.
3. For all programming questions, starter code is provided as a Jupyter notebook. You should
work out your solution in the notebook. We recommend uploading your notebook to a hosted
service like Google Colab to avoid the installation of dependencies and minimize setup
time. When you are done, take a screenshot of your code and output and include it in your
PDF. In addition, you should download the Jupyter notebook with your solutions (File →
Download → Download .ipynb) and also download your code (File → Download →
Download .py). You need to submit both as a zip archive named “CMPT726-419 A2 ⟨Last
Name⟩ ⟨Student ID⟩.zip”, whose directory structure should be the same as that of the zip
archive for the starter code. Do NOT include any data files we provided. Please include a
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 2
short file named README listing your name and student ID. The PDF should not be in-
cluded in the zip archive and should be submitted separately. Please make sure that your
code doesn’t take up inordinate amounts of time or memory. If your code cannot be executed,
your solution cannot be verified.
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 3
1 Probability (Theory and Programming)
To date, COVID-19 has infected almost 370 million people around
the world. Several tests have been developed to detect COVID-19
infections, with some more reliable than others. In this problem,
we will use Bayes rule to interpret the results of different COVID-
19 tests.
(a) Suppose you are a doctor and suspect that one of your patients may have been infected with
COVID-19. So you order a CRP test, which has a true positive rate of 76% and true negative
rate of 68%, i.e.: 76% of the test results are positive when the patient actually has the virus,
and 68% of the test results are negative when the patient doesn’t have the virus. You work in
country A, where 25 out of every 1,000 people has COVID-19. How likely is your patient to
actually have COVID-19 when they test negative? How about if they test positive?
(b) If your patient tests positive, are they more likely to actually have COVID-19 than not?
If your patient tests negative, are they more likely to not have COVID-19 than to have it?
Show your work.
(c) Now suppose you are in country B, where 10 out of every 1,000 people have COVID-19. There
is a new test with an accuracy of p when it returns positive, that is, p of those with COVID-19
test positive and 1− p of those without COVID-19 also test positive. If a patient tests positive,
what is the probability of them actually having COVID-19? Let q be the probability that a
patient actually has COVID-19 if they test positive. Suppose we consider a test to be reliable
if q ≥ 0.76. How much does p have to be at least in order for the test to be considered
reliable?
(d) Consider a random variable X with the following probability density function (PDF):
f (x) =
cx 0 ≤ x ≤ 10 otherwise (1)
Calculate the following quantities:
i. c
ii. E[X]
iii. Var(X)
iv. E[2X−2]
v. Var(2X−2)
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 4
(e) For the next three parts, we will explore three different distributions and estimate their moments
and probability density from samples. In general, given samples {x⃗i}ni=1 of a random vector X⃗,
the true mean µ⃗ := E[X⃗] and true covariance Σ := E[(X⃗ − E[X⃗])(X⃗ − E[X⃗])⊤] of X⃗ can be
estimated with the sample mean ̂⃗µ and sample covariance Σ̂, the formulas of which are below:
̂⃗µ = 1
n
n∑
i=1
x⃗i (2)
Σ̂ =
1
n − 1
n∑
i=1
(
x⃗i − ̂⃗µ) (x⃗i − ̂⃗µ)⊤ (3)
We can also estimate the probability density function of a random vector X⃗ from samples when
X⃗ has relatively low dimensions. The kernel density estimator (KDE) is one such approach.
In the starter code, we provide a function, KDE plot, which estimates the joint and marginal
probability densities of samples of 2D random vectors and plots them. In the plot, the con-
tour plot of the estimated joint distribution is shown at the centre, and the estimated marginal
distributions of each coordinate are shown at the top and on the right.
In this part, we will consider the multivariate Gaussian distribution with mean µ⃗ =
23
and covariance Σ =
10 00 3
. Generate 5000 samples from this distribution using the
generate gaussian samples function provided in the starter code and plot the samples
as a scatter plot using the scatter plot function. Implement the estimate mean and
estimate cov functions to compute the sample mean ̂⃗µ and the sample covariance Σ̂,
and then run them on the generated samples. How do the sample mean ̂⃗µ and sample
covariance Σ̂ compare to the true mean µ⃗ and covariance Σ? Finally, use the KDE plot
function to generate a plot of the estimated joint and marginal densities. For this part,
you are not permitted to use the built-in Numpy functions for computing the sample mean and
covariance, but you may use built-in Numpy functions for other elementary operations, such
as summation and matrix multiplication.
Make sure to include the values of the sample mean ̂⃗µ and sample covariance Σ̂ and the
plots of the samples and the estimated probability density in the output. Take screenshots
of both your code and the output in the Jupyter notebook and include them in your PDF
submission. The screenshot resolution should be high enough for the code and output to be
readable after zooming in.
(f) Take the samples from part (e) and apply the following function f to all the samples to obtain
a new set of samples:
f (X⃗) = f
((
X1 X2
)⊤)
=
(
X1
∣∣∣X2 − µ2∣∣∣ + µ2 )⊤ if X1 ≥ µ1(
X1 −
∣∣∣X2 − µ2∣∣∣ + µ2 )⊤ if X1 < µ1 (4)
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 5
where µ1 and µ2 denotes the first and second elements of the true mean µ⃗ respectively, as
defined in part (e). More precisely, the new set of samples we consider is { f (x⃗i)}ni=1, where
{x⃗i}ni=1 is the set of samples from part (e).
Implement the transformation function in the Jupyter notebook to compute the trans-
formation f , and then run it on the samples from part (e). Plot the new samples as a
scatter plot using the scatter plot function. Also generate a plot of the estimated joint
and marginal densities using the KDE plot function. Compare the estimated joint and
marginal densities to those from part (e) and explain what the similarities and differences
are. What does this show in terms of the Gaussianity of the joint and marginal distribu-
tions of the new samples? Compute the sample mean ̂⃗µ and the sample covariance Σ̂
using the estimate mean and estimate cov functions. Compare the sample mean and
covariance to those from part (e) and explain what the similarities and differences are.
Make sure to include the values of the sample mean ̂⃗µ and sample covariance Σ̂ and the
plots of the samples and the estimated probability density in the output. Take screenshots
of both your code and the output in the Jupyter notebook and include them in your PDF
submission. The screenshot resolution should be high enough for the code and output to be
readable after zooming in.
(g) Consider two independent random variables Y and Z. Y follows a univariate Gaussian distribu-
tion with mean µ1 = 2 and variance σ21 = 10, and Z follows a univariate Gaussian distribution
with mean µ2 = 3 and variance σ22 = 3.
Generate 5000 samples of Y and 5000 samples of Z, which we will denote as {yi}ni=1 and
{zi}ni=1 respectively. Put the two sets of samples together into a set of 2D samples, i.e.:{(
yi zi
)⊤}n
i=1
. Plot this set of 2D samples as a scatter plot using the scatter plot func-
tion. Compute the sample mean ̂⃗µ and the sample covariance Σ̂ of the joint distribution(
Y Z
)⊤
from these samples using the estimate mean and estimate cov functions.
How do the sample mean and covariance compare to those from part (e)? Generate a
plot of the estimated joint and marginal densities using the KDE plot function. How does
this plot compare to the plot from part (e)? What does this show in terms of the rela-
tionship between the distributions considered in this part and the distribution in part
(e)?
2 Linear Regression (Theory and Programming)
The earth mover’s distance (EMD), also known as the Wasserstein distance, is a way of measuring
distance between two probability distributions. In this question, we will formulate EMD as a
problem similar to linear regression and solve for it.
(a) Consider two discrete probability distributions P, and Q, each over l possible outcomes, de-
noted as {ui}li=1 and {v j}lj=1 respectively. We can represent the probability mass functions of P
and Q as two vectors, p⃗, q⃗ ∈ Rl, where each element of p⃗ and q⃗, denoted as pi and q j, represents
the probability mass at ui and v j respectively.
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 6
We would like to transform P into Q by moving probability mass from U to V . More con-
cretely, if we think of the probability mass of P as l piles of dirt placed at {ui}li=1 where the
amount of dirt at ui is pi, our goal is to move appropriate amounts of dirt to {v j}lj=1, so that the
amount of dirt at v j is q j, which would represent the probability mass of Q.
Figure 1: A sample of the probability distributions P and Q when l = 10.
There are infinitely many ways to move the dirt around; each such way is known as a transport
plan. Each transport plan requires a different amount of work. Naturally, we would like to
accomplish our goal with as little as work as possible, and so would like to find the optimal
transport plan. This problem is known as optimal transport.
More precisely, the amount of dirt moved from one location to another is known as the flow,
and amount of work that is required is given by the flow multiplied by the distance moved. For
example, the work required to move t units of dirt from ui to v j is t · |ui − v j|. The earth mover’s
distance (EMD) is the minimal total work that is required to transform the piles of dirt given
by P to those given by Q.
Consider the transport plan (which is not necessarily optimal) shown in the table below. We
will move dirt step-by-step in a greedy fashion following three rules:
• We complete one step before moving onto the next.
• At each step we move as much dirt as possible until the destination pile has accumulated
the desired amount of dirt.
• If we run out of dirt in a source pile, we don’t transport any dirt.
Fill in the table below with the amount of flow and work at each step. Calculate the total
flow and total work and put them below the table. Round all numbers to two decimal
places.
Source Distribution P Target Distribution Q
Location Probability Mass Location Probability Mass
u1 = 1 p1 = 0.6 v1 = 2 q1 = 0.8
u2 = 3 p2 = 0.4 v2 = 5 q2 = 0.2
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 7
Step From To Distance Flow Work
1 u1 v1
2 u1 v2
3 u2 v1
4 u2 v2
Total Flow:
Total Work:
(b) We are given a discrete probability distribution P, which has a probability mass of 1 on u1. We
would like to transform P into some distribution Q over {v j}10j=1 such that the sum of probability
masses at odd-indexed outcomes, i.e.: {v2k+1}4k=0, is 0.8. Let γ⃗ ∈ R10 represent the transport plan
from P to Q, where the jth element, γ j, represents the flow from u1 to v j. For the transport plan
to be a valid plan to turn P into Q, two conditions must be satisfied: (1) the total flow from
u1 must match the probability mass of P on u1, and (2) the total flow to each v j must satisfy
the property of Q on the sum of odd-indexed probability masses. Mathematically, the former
can be characterized by the equation
∑10
j=1 γ j = 1, and the latter can be characterized by the
equation
∑4
k=0 γ2k+1 = 0.8.
Since there are two equations and 10 unknowns, there are infinitely many valid plans. For
this part, suppose we prefer the plan with the smallest flow between each pair of locations.
Mathematically, this can be expressed as the following optimization problem: minγ⃗
∑10
j=1 γ
2
j .
To balance this goal with the desire to find a valid plan, we can add terms to this objective that
penalize violations of the conditions above. To this end, we introduce two additional terms,
(
∑10
j=1 γ j − 1)2 and (
∑4
k=0 γ2k+1 − 0.8)2, so that the final optimization problem becomes:
min
γ⃗
L(γ⃗), where L(γ⃗) :=
10∑
j=1
γ2j +
10∑
j=1
γ j − 1
2
+
4∑
k=0
γ2k+1 − 0.8
2
Formulate this as a ridge regression problem, that is, write down what the data matrix A,
label vector y⃗, parameter vector w⃗ and the regularizer coefficient λ correspond to in the
context of this problem. Then write down an analytical expression for the optimal γ⃗ that
minimizes L, which can be in terms of A, y⃗ and λ. You do not need to compute the value of
optimal γ⃗ for this part.
(c) In the Jupyter notebook included in the starter code zip archive, implement the part 2c
function to compute the optimal γ⃗. Make sure to include the values of the data matrix A,
the label vector y⃗, the optimal γ⃗, the prediction and the loss in the output. Take screen-
shots of both your code and the output in Jupyter notebook and include them in your
PDF submission. The screenshot resolution should be high enough for the code and output to
be readable after zooming in.
(d) We are given two discrete probability distributions P and Q over {ui}li=1 and {v j}lj=1 respectively,
where l > 2. We would like to transform P into some distribution Q. Since P now has more
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 8
than one possible outcomes, the transport plan is now represented as a matrix Γ ∈ Rl×l, where
each element Γi, j represents the flow from ui to v j.
For the transport plan to be a valid plan to turn P into Q, two conditions must be satisfied: (1)
the total flow from each ui must match the probability mass of P, and (2) the total flow to each
v j must match the probability mass of Q. Mathematically, the former can be characterized
by l equations
∑l
j=1 Γi, j = pi for each i, and the latter can be characterized by l equations∑l
i=1 Γi, j = q j for each j.
Since there are 2l equations and l2 unknowns, there are infinitely many valid plans. As in part
(b), suppose we prefer the plan with the smallest flow between each pair of locations. This
preference, combined with the conditions above, yields the following optimization problem:
min
Γ
L(Γ), where L(Γ) :=
l∑
i=1
l∑
j=1
Γ2i, j +
l∑
i=1
l∑
j=1
Γi, j − pi
2
+
l∑
j=1
l∑
i=1
Γi, j − q j
2
Formulate this as a ridge regression problem, that is, write down what the data matrix A,
label vector y⃗, parameter vector w⃗ and the regularizer coefficient λ correspond to in the
context of this problem. Then write down an analytical expression for the optimal Γ that
minimizes L, which can be in terms of A, y⃗ and λ. You do not need to compute the value of
optimal Γ for this part.
Hint: The data matrix A should be a (2l) × l2 matrix, and Γ should be flattened into a vector,
which we will denote as vec(Γ).
(e) In this part, we take P and Q to be Poisson distributions with parameters 4 and 10 truncated
to {0, . . . , 9} and {5, . . . , 14} respectively. The exact values of p⃗, q⃗, {ui}li=1 and {v j}lj=1 are given
in the Jupyter notebook included in the zip archive. Implement the part 2e function in the
Jupyter notebook to compute the optimal vec(Γ). Make sure to include the values of the
data matrix A, the label vector y⃗, the optimal Γ, the prediction and the loss in the output.
Take screenshots of both your code and the output in Jupyter notebook and include them
in your PDF submission. The screenshot resolution should be high enough for the code and
output to be readable after zooming in.
(f) Consider the same setting as part (d), except that we now prefer the plan with the minimum
total work rather than the plan with the smallest flow between each pair of locations. To
formulate the objective function, we construct a distance matrix D ∈ Rl×l, where each element
Di, j is the distance between ui and v j (i.e.: |ui − v j|) and is non-negative. The total work is∑l
i=1
∑l
j=1 Γi, jDi, j, which can also be written as ⟨Γ,D⟩F or vec(Γ)⊤vec(D), where ⟨·, ·⟩F denotes
the Frobenius inner product and vec(·) denotes a function that flattens the input matrix into a
column vector.
This results in the following optimization problem:
min
Γ
L(Γ), where L(Γ) :=
l∑
i=1
l∑
j=1
Γi, jDi, j
2
+
l∑
i=1
l∑
j=1
Γi, j − pi
2
+
l∑
j=1
l∑
i=1
Γi, j − q j
2
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 9
where the distance matrix D is treated as a constant.
Rewrite the loss function in terms of vectors and matrices, which may be ones given in
the question or ones you define, and then find an analytical expression for the optimal
Γ that minimizes L. To find an optimal Γ, you need to derive the expression for a critical
point and prove the convexity of the loss function. You do not need to compute the value of an
optimal Γ for this part.
(g) Consider the same distributions P and Q as in part (e). Implement the part 2g function in
the Jupyter notebook to compute the optimal vec(Γ). Make sure to include the values of
the data matrix A, the label vector y⃗, the optimal Γ, the distance matrix D, the prediction
and the loss in the output. Take screenshots of both your code and the output in Jupyter
notebook and include them in your PDF submission. The screenshot resolution should be
high enough for the code and output to be readable after zooming in.
©Simon
Fraser University. All Rights Reserved. Sharing this publicly
constitutes both academic misconduct and copyright violation. 10