程序代写案例-CMPT 726 /-Assignment 1|学霸联盟

程序代写案例-CMPT 726 /-Assignment 1

时间：2022-03-10

Machine Learning
CMPT 726 / 419
Ke Li

SFU School of Computing Science

February 11, 2022
1
Announcements
Assignment 1 due on Sunday at 11:59pm.

• Please submit your assignment to CrowdMark - you should have all gotten an
email. (This supersedes the instructions on the first page of the assignment.)

• You do not need to split the PDF into different files for different parts. Instead
you should upload the same PDF for each part and select the relevant pages
for each part.

• Upload your assignment early! The server could go down, the Wi-Fi could go
down, there might be a power outage, your laptop could freeze, etc.

• We give a 30-minute buffer to resolve any technical difficulties, and you will lose
1% every minute after the 30 minutes are up. No exceptions will be granted.
2
Recap
3
General Polynomial Features
Polynomial features map raw data to monomials of varying degrees.

E.g.: Polynomial features up to degree 2

In general, perform binomial expansion of and make
each term (without the coefficient) a basis function.
ϕ( ⃗x ) = (1 1 1 x1 ⋯ xn x21 x1x2 ⋯ x1xn x22 x2x3 ⋯ x2n)⊤
(1 + x1 +⋯ + xn)d
4
x21 x1x2 x1xn x22 x2x3 ⋯ x2n
degree 2 (don't forget the cross terms!)
x1 ⋯ xn
degree 1
1
⏟
degree 0
Overfitting
5
d = 5 d = 10 d = 20
Loss gets smaller as degree increases
What degree should we choose?

Should we choose because it attains the lowest loss?

Recall that the data is actually generated from a degree-4 polynomial. So the
trained model is too complex - it is fitting to noise. This is known as overfitting.

Intuitively, feels like a better choice than . Why is this exactly?
d = 20
d = 5 d = 20
Generalization
Let’s generate a new set of data points from the same polynomial as before: 

This new set of data points (green) come from the same distribution as the original set of data
points (blue).

While the loss on the old data is smaller for , it is larger on the new data!

So the trained model does not generalize well to new data.
yi = − 0.1x4i − 0.4x3i − 0.5x2i + 0.5xi + 1 + ϵi, where ϵi ∼ (0,0.1), xi ∼ Uniform(−1,1)
d = 20
6
d = 5 d = 20
Variance
Consider the
trained models on
the old and new
data.

When , the
models don’t differ
by much.

When , the
models differ
substantially.
d = 4
d = 20
7
d = 4 d = 20
Goal of Machine Learning
Training set or train set: set of observations used for training / fitting the model

Testing set or test set: set of new observations (which are not used for training)

Goal of machine learning is to achieve low loss / high accuracy on the testing set.

The testing set should approximate how much loss we expect the model to incur on
unseen observations when deployed in the real world.

Never, ever train on the testing set! (Otherwise the testing set becomes pointless)

In fact, the testing set should not be used at any point during the development process. It
should only be used once when you are completely satisfied with the model and are ready
to deploy it.
8
Goal of Machine Learning
Training / train loss and testing / test loss: loss on the training set and testing set respectively.

Testing loss is generally higher than training loss.

When the model achieves low training loss and low testing loss, the model is said to generalize.

Generalization gap: Testing Loss — Training Loss

Overfitting: When the model is fitted too well to the data, i.e.: when testing loss is much higher
than training loss

Underfitting: When the model is not fitted well enough, i.e.: when both training and testing loss
are high and the generalization gap is small

Model capacity / expressive power: The flexibility of the model in fitting the data. A model that
can fit more “shapes” of the data distribution is said to be more expressive / have more model
capacity / expressive power.
9
Underfitting vs. Overfitting
10
Underfitting Just right Overfitting
d = 4 d = 20d = 1
Underfitting vs. Overfitting
11
Underfitting
Just right
Overfitting
Model Capacity
Loss
Avoiding Overfitting
How can we avoid overfitting?

• We can collect more training data, which is often expensive or impractical

• We can manually clean the training data to reduce the amount of noise
present, which is often expensive or impractical

• We can limit the model capacity

• We can add a regularizer/impose a prior on the parameters (more on what
this means later)
12
Adding More Data
What if we can collect more data?

In general, training a model on more data makes it generalize better. So if our goal is to produce
a model that achieves the best accuracy on the testing set, it is always better to train on as much
data as possible. (We will see an exception to this next.)
13
d = 10 d = 20d = 5
Hyperparameters and Model Selection
Hyperparameters specify which model is used to fit the
data, e.g.: the degree of the polynomial.

Many (but not all) hyperparameters control the model
capacity.

Choosing the best hyperparameter setting is known as
model selection.

It is tempting to pick the hyperparameters based on
which setting achieves the lowest testing loss, however,
we must never, ever do so!

• This would cause overfitting to the testing set, which
would mean the model would generalize poorly to
actual unseen data when deployed in the real world.

• This is dangerous.
14
Underfitting
Just right
Overfitting
Model Capacity
Loss
Validation Set
To avoid performing model selection based on the testing loss, we can simulate the testing loss by
carving out a portion of the training set which is not used for training the model.

The carved-out set is known as a validation set, and the remainder becomes a smaller training set.

We will use the validation set to simulate the role of the testing set, and compute a loss on the
validation set, which is known as the validation loss. This validation loss is used for model selection.

So instead of picking the hyperparameter  
setting that minimizes the testing loss, we  
will pick the one that minimizes the validation  
loss.

Both the validation set and the testing set are  
held-out data, i.e.: data that is not used to  
trained the model. This is a situation where we  
don’t train our model on as much data as we  
have access to.
15
Selected Model
-Fold Cross ValidationK
Drawback of validation set: The training set is smaller, so the validation loss is a less
accurate gauge of true performance on the testing set.

-fold cross validation: Divide dataset into K equal-sized subsets, and train each model
times. Each time treat one of the subsets as the validation set and the others as the
training set. At the end, average the validation losses and use the average to perform
model selection.

K
K
K
16
Useful when little data is available, but comes
at the expense of greater computational cost.
After Model Selection
After we have finished performing model selection, we should train the
model one last time with the selected hyperparameters on all the data we
have access to before deploying it.

This is because in general, training a model on more data makes it
generalize better, and after model selection is complete, the held-out data
will no longer be useful for model selection and can be better to used to
train the final model.
17
In-Class Quiz
Q1: Which of the following facts is NOT true?
(A) In general, the training loss is less than the validation loss
(B) In general, the validation loss is less than the testing loss
(C) Hyperparameters should be chosen so that the training loss is
minimized
(D) Hyperparameters should be chosen so that the validation loss is
minimized
(E) Hyperparameters should be chosen so that the testing loss is
minimized
(F) All of the above are true
(G) Don’t know
18
Overfitted OLS Model Parameters
Recall: For ordinary least squares, the optimal parameters are:

When , what happens to ?

is symmetric, so we can perform eigendecomposition, i.e.: .

It turns out that the smallest eigenvalue of (or equivalently the smallest singular value of ) is .

Since , the largest eigenvalue of is
.

Since is positive semi-definite, all the entries of must be non-negative and so all the entries of must be as well.

So is also a singular value decomposition of . This implies that the largest singular value is
, and so . So could be times larger than .
⃗w* := argmin⃗
w
L( ⃗w ) = (X⊤X)−1X⊤ ⃗y = X† ⃗y
d = 20 ⃗w*
X⊤X X⊤X = UΛU⊤
X⊤X X < 1.8 × 10−7
(X⊤X)−1 = (UΛU⊤)−1 = U⊤−1Λ−1U−1 = U⊤⊤Λ−1U⊤ = UΛ−1U⊤ (X⊤X)−1
> 1/(1.8 × 10−7) > 5.5 × 106
X⊤X Λ Λ−1
UΛ−1U⊤ (X⊤X)−1
> 5.5 × 106 (X⊤X)−1
2
> 5.5 × 106 ∥ ⃗w*∥2 5.5 × 106 ∥X⊤ ⃗y ∥2
19
Ridge Regression
Recall: When OLS overfits, could be large.

Idea: Change the loss function to penalize large .

OLS: ,  
where ,

Ridge regression: ,  
where is a hyperparameter.
∥ ⃗w*∥2
∥ ⃗w*∥2
L( ⃗w ) =
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2 = ∥ ⃗y − X ⃗w∥22
X =
⃗x ⊤1
⋮
⃗x ⊤N
⃗y =
y1
⋮
yN
L( ⃗w ) =
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2 + λ∥ ⃗w∥22 = ∥ ⃗y − X ⃗w∥22 + λ∥ ⃗w∥22
λ > 0
20
Linear Regression
21
Ridge Regression
 
 
 

 
 
 
 
 

L( ⃗w ) = ∥ ⃗y − X ⃗w∥22 + λ∥ ⃗w∥22
= ( ⃗y − X ⃗w)⊤ ( ⃗y − X ⃗w) + λ ⃗w⊤ ⃗w
= ⃗y ⊤ ⃗y − (X ⃗w)⊤ ⃗y − ⃗y ⊤ (X ⃗w) + (X ⃗w)⊤ (X ⃗w) + λ ⃗w⊤ ⃗w
= ⃗y ⊤ ⃗y − (2X⊤ ⃗y )⊤ ⃗w + ⃗w⊤ (X⊤X) ⃗w + λ ⃗w⊤ ⃗w
∂L
∂ ⃗w =
∂ ( ⃗y ⊤ ⃗y )
∂ ⃗w −
∂((2X⊤ ⃗y )⊤ ⃗w)
∂ ⃗w +
∂( ⃗w⊤ (X⊤X) ⃗w)
∂ ⃗w +
∂ (λ ⃗w⊤ ⃗w)
∂ ⃗w = 0
0 − 2X⊤ ⃗y + (X⊤X + (X⊤X)⊤) ⃗w + λ (I + I⊤) ⃗w = 0
−2X⊤ ⃗y + 2 (X⊤X) ⃗w + 2λI ⃗w = 0
−2X⊤ ⃗y + 2 (X⊤X + λI) ⃗w = 0
2 (X⊤X + λI) ⃗w = 2X⊤ ⃗y
(X⊤X + λI) ⃗w = X⊤ ⃗y
⃗w = (X⊤X + λI)−1X⊤ ⃗y
22
Ridge Regression
Recall:

 
 
 
 

Claim:

Proof:  
For any ,  
Since and ,

So the loss function is strictly convex.
∂L
∂ ⃗w = − 2X
⊤ ⃗y + 2 (X⊤X + λI) ⃗w
∂2L
∂ ⃗w∂ ⃗w⊤ =
∂
∂ ⃗w ( ∂L∂ ⃗w )
= ∂
∂ ⃗w (−2X⊤ ⃗y + 2 (X⊤X + λI) ⃗w)
= ∂
∂ ⃗w (−2X
⊤ ⃗y ) + ∂∂ ⃗w (2 (X⊤X + λI) ⃗w)
= 0 + 2 (X⊤X + λI)⊤
= 2 (X⊤X + λI)
X⊤X + λI ≻ 0
⃗w⊤ (X⊤X + λI) ⃗w = ⃗w⊤ (X⊤X) ⃗w + ⃗w⊤ (λI) ⃗w = (X ⃗w)⊤ (X ⃗w) + λ ⃗w⊤ ⃗w = ∥X ⃗w∥22 + λ∥ ⃗w∥22
⃗w ≠ ⃗0 ∥ ⃗w∥22 > 0
∥X ⃗w∥22 ≥ 0 λ > 0 ∥X ⃗w∥22 + λ∥ ⃗w∥22 > 0 ∀ ⃗w ≠ ⃗0
23
Ridge Regression
24
For a strictly convex function, there is a 
unique critical point, which is a local  
minimum, which is a global minimum.

So, the critical point  
is the only  
optimal parameter vector, regardless of  
whether is full-rank or not.

⃗w* = (X⊤X + λI)−1X⊤ ⃗y
X
Ridge Regression: Summary
Model:

Parameters:  
Loss function: ,  
where ,

Optimal Parameters:
̂y = ⃗w⊤ ⃗x
⃗w
L( ⃗w ) =
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2 + λ∥ ⃗w∥22 = ∥ ⃗y − X ⃗w∥22 + λ∥ ⃗w∥22
X =
⃗x ⊤1
⋮
⃗x ⊤N
⃗y =
y1
⋮
yN
⃗w* := argmin⃗
w
L( ⃗w ) = (X⊤X + λI)−1X⊤ ⃗y
25
OLS vs. Ridge Regression
Model:

Parameters:

OLS:
Loss function:

 

Optimal Parameters:

Ridge Regression:
Loss function:

 

Optimal Parameters:
̂y = ⃗w⊤ ⃗x
⃗w
L( ⃗w ) =
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2
= ∥ ⃗y − X ⃗w∥22
⃗w* := argmin⃗
w
L( ⃗w ) = (X⊤X)−1X⊤ ⃗y
L( ⃗w ) =
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2 + λ∥ ⃗w∥22
= ∥ ⃗y − X ⃗w∥22 + λ∥ ⃗w∥22
⃗w* := argmin⃗
w
L( ⃗w ) = (X⊤X + λI)−1X⊤ ⃗y
26
In-Class Quiz
Q2: Which of the following facts about ridge regression is NOT true?
(A) Ridge regression is less prone to overfitting compared to ordinary least
squares
(B) Ridge regression always has a unique optimal parameter vector
(C) Compared to ordinary least squares, ridge regression adds a
regularizer
(D) Ridge regression uses more hyperparameters than ordinary least
squares
(E) Ridge regression uses more parameters than ordinary least squares
(F) Ridge regression uses a strictly convex loss function
(G) All of the above are true
(H) Don’t know
27
Probability Review
28
Terminology
• Sample space : Set of all possible outcomes of a random phenomenon 
E.g.: Toss two coins - sample space is {HH, HT, TH, TT}

• Event : A subset of the possible outcomes 
E.g.: The event that the second coin turns out to be heads, i.e.: {HH, TH}

• Probability (formally a “probability measure”) : A function that assigns every
possible event a number, representing the chance that the event happens 
E.g.: If both coins are fair,

• Random variables (RVs): Variables whose values depend on the outcome of a random
phenomenon  
E.g.: , or (the number of heads)
Ω
E
Pr( ⋅ )
Pr(second coin turns out to be heads) = 1
2
Xi = {1 ith coin turns out to be heads0 otherwise Y =∑
i
Xi
29
Terminology
• Discrete random variables: RVs that take on values from a discrete set

• Continuous random variables: RVs that take on values from a continuous
range

• Probability distribution: a function that characterizes the probability of
different realizations of RVs

• Can be represented as cumulative distribution functions (cdfs), probability
mass functions (pmfs) in the case of discrete RVs, or probability density
functions (pdfs)

• Support of distribution : the set of realizations of RVs where the pmf
(in the case of discrete RVs) or pdf (in the case of continuous RVs) is non-zero
supp(X)
30
Discrete vs. Continuous RVs
Discrete random variables

• Cumulative distribution functions
(cdf):

• Probability mass functions (pmf):

• Examples: Bernoulli RVs,
Categorical RVs

Continuous random variables

• Cumulative distribution functions
(cdf):

• Probability density functions (pdf):

• Examples: Uniform RVs, Normal
RVs
FX(x) = Pr(X ≤ x)
pX(x) = Pr(X = x)
FX(x) = Pr(X ≤ x)
fX(x) =
d
dx
FX(x)
31
Discrete vs. Continuous RVs
Discrete random variables Continuous random variables
32
pdf
cdf
pmf
cdf
Credit: Wikipedia Credit: Wikipedia
Discrete vs. Continuous RVs
Discrete random variables

Continuous random variables

may be larger than 1 (could be
arbitrarily large)

pX(x) = Pr(X = x) ≥ 0 ∀x
pX(x) ≤ 1 ∀x
∑
x∈Ω
pX(x) = 1
FX(x) = Pr(X ≤ x) = ∑
x˜∈Ω:x˜≤x
pX(x˜)
fX(x) =
d
dx
FX(x) ≥ 0 ∀x
Pr(X = x) = 0 ∀x ⟹ fX(x) ≠ Pr(X = x)
fX(x)
∫
∞
−∞
fX(x) = 1
FX(x) = Pr(X ≤ x) = ∫
x
−∞
fX(x)
33
cdf is non-decreasing
cdf could have
arbitrarily high slope
Common Discrete Distributions
Bernoulli distribution:

More mathematically convenient form:

X ∼ Bernoulli(p)
pX(x) = Pr(X = x) = {p x = 11 − p x = 0
pX(x) = Pr(X = x) = px(1 − p)1−x
34
Common Discrete Distributions
Categorical distribution:

More mathematically convenient form:

, where
X ∼ Categorical(p1,…, pk)
pX(x) = Pr(X = x) =
p1 x = 1
p2 x = 2
⋮ ⋮
pk x = k
pX(x) = Pr(X = x) =
k
∏
i=1
p[x=i]i [x = i] = {1 x = i0 x ≠ i
35
Common Continuous Distributions
Uniform distribution:

X ∼ Uniform(a, b)
fX(x) ={
1
b − a x ∈ [a, b]
0 x ∉ [a, b]
supp(X) = [a, b]
36
pdf
cdf
Credit: Wikipedia
Common Continuous Distributions
Normal distribution:
(In ML, more commonly referred to as the Gaussian distribution)

Standard normal distribution:

and

Hence,
X ∼ (μ, σ2)
fX(x) =
1
2πσ2
exp(− (x − μ)
2
2σ2 )
= 1
2πσ2
e−
(x − μ)2
2σ2
supp(X) = ℝ
Z ∼ (0,1)
Z + μ ∼ (μ,1) σZ ∼ (0,σ2) ⟹ μ + σZ ∼ (μ, σ2)
X − μ
σ
∼ (0,1)
37
pdf
cdf
Credit: Wikipedia
Multiple Random Variables
• What if we have multiple random variables, which may depend on one another? How do
we represent the dependence between them?

• E.g.: Tomorrow’s temperature and snowfall

• Going forward, will use slightly different notation:

• Will use capital letters, e.g.: , to denote RVs and corresponding lowercase letters,
e.g.: , to denote a realized value of the RVs. Can therefore drop the subscripts in
and .

• Will overload the notation to mean the pmf if is discrete and the pdf
if is continuous.

• So, the cdf of will be denoted as and the pdf/pmf of will be denoted as
X
x pX(x)
FX(x)
p(x) pX(x) X fX(x)
X
X F(x) X p(x)
38
Joint Probability Distributions
39
Two random variables:

In general:

F(x, y) = Pr(X ≤ x and Y ≤ y)
p(x, y) =
Pr(X = x and Y = y) X,Y are discrete
∂2
∂x∂y F(x, y) X,Y are continuous
F(x1,…, xn) = Pr(X1 ≤ x1 and ⋯ and Xn ≤ xn)
p(x1,…, xn) ={
Pr(X1 = x1 and ⋯ and Xn = xn) X1,…,Xn are discrete
∂n
∂x1⋯∂xn
F(x1,…, xn) X1,…,Xn are continuous
Credit: Jeff Howbert
Marginal Probability Distributions
40
Two random variables:

In general:

p(x) =
∑y∈ΩY p(x, y) X,Y are discrete
∫∞−∞ p(x, y)dy X,Y are continuous
p(x1,…, xm) =
∑xm+1∈ΩXm+1
⋯∑xn∈ΩXn p(x1,…, xn) X1,…,Xn are discrete
∫∞−∞⋯ ∫
∞
−∞ p(x1,…, xn)dxm+1⋯dxn X1,…,Xn are continuous
Credit: Jeff Howbert
“Marginalizing out ”Xm+1,…,Xn
Conditional Probability Distributions
41
Two random variables:

In general:

p(y |x) = p(x, y)
p(x)
=
p(x, y)
∑y∈ΩY p(x, y)
X,Y are discrete
p(x, y)
∫∞−∞ p(x, y)dy
X,Y are continuous
p(xm+1,…, xn |x1,…, xm) =
p(x1,…, xn)
p(x1,…, xm)
=
p(x1,…, xn)
∑xm+1∈ΩXm+1
⋯∑xn∈ΩXn
p(x1,…, xn)
X1,…,Xn are discrete
p(x1,…, xn)
∫∞−∞⋯ ∫
∞
−∞ p(x1,…, xn)dxm+1⋯dxn
X1,…,Xn are continuous
Credit: Jeff Howbert
Chain Rule of Probability
Two random variables:

In general:

p(y |x) = p(x, y)
p(x)
⟹ p(x, y) = p(x)p(y |x)
p(x1,…, xn) = p(x1)p(x2 |x1)p(x3 |x1, x2)⋯p(xn |x1,…, xn−1)
42
Chain Rule of Probability (Conditional Case)
Two random variables:

In general:

p(y |x, z) = p(x, y |z)
p(x |z)
⟹ p(x, y |z) = p(x |z)p(y |x, z)
p(x1,…, xn |z1,…, zl) = p(x1 |z1,…, zl)p(x2 |x1, z1,…, zl)⋯p(xn |x1,…, xn−1, z1,…, zl)
43
Independence
Two random variables and are independent if:

(or equivalently, )

Since in general, an equivalent definition is

Random variables are (mutually) independent if:

X Y
p(y |x) = p(y) ∀x p(x |y) = p(x) ∀y
p(x, y) = p(x)p(y |x)
p(x, y) = p(x)p(y)
X1,…,Xn
p(x1,…, xn) = p(x1)⋯p(xn)
44
Conditional Independence
Two random variables and are conditionally independent given if:

(or equivalently, )

Since in general, an equivalent definition is

Random variables are conditionally independent given
if:

X Y Z = z
p(y |x, z) = p(y |z) ∀x p(x |y, z) = p(x |z) ∀y
p(x, y |z) = p(x |z)p(y |x, z)
p(x, y |z) = p(x |z)p(y |z)
X1,…,Xn
Z1 = z1,…,Zl = zl
p(x1,…, xn |z1,…, zl) = p(x1 |z1,…, zl)⋯p(xn |z1,…, zl)
45
Bayes’ Rule
An identity that relates to :

Expanding further is often useful:

(assuming continuous RVs)

True in the conditional case as well: (show this as an exercise)

p(y |x) p(x |y)
p(y |x) = p(x, y)
p(x)
= p(y)p(x |y)
p(x)
p(x)
p(y |x) = p(y)p(x |y)
p(x)
= p(y)p(x |y)
∫∞−∞ p(x, y)dy
= p(y)p(x |y)
∫∞−∞ p(y)p(x |y)dy
p(y |x, z1,…, zl) =
p(y |z1,…, zl)p(x |y, z1,…, zl)
p(x |z1,…, zl)
46
Expected Value
Two random variables:

In general:

(Technically this is the “law of the unconscious statistician” rather than the definition)

could be vector-valued, in which case , where is the th component
of .
E[ f(X,Y )] =
∑x∈ΩX∑y∈ΩY f(x, y)p(x, y) X is discrete
∫∞−∞ ∫
∞
−∞ f(x, y)p(x, y)dxdy X is continuous
E[ f(X1,…,Xn)] =
∑x1∈ΩX1
⋯∑xn∈ΩXn f(x1,…, xn)p(x1,…, xn) X is discrete
∫∞−∞⋯ ∫
∞
−∞ f(x1,…, xn)p(x1,…, xn)dx1⋯dxn X is continuous
f( ⋅ ) E[ f(X1,…,Xn)] =
E[ f1(X1,…,Xn)]
⋮
E[ fm(X1,…,Xn)]
fi( ⋅ ) i
f( ⋅ )
47
Expected Value
Linearity of expectation:

(always true, even if and are dependent)

Not multiplicative unless independent:

In general,

However, if and are independent,
E[X + Y] = E[X] + E[Y] X Y
E[cX] = cE[X]
E[XY] ≠ E[X]E[Y]
X Y E[XY] = E[X]E[Y]
48
Moments
Mean:

Covariance:  

Covariance is symmetric:

Variance:  

Standard Deviation:

Pearson’s Correlation Coefficient:
E[X]
Cov(X,Y) = E [(X − E[X])(Y − E[Y])]
= E[XY] − E[X]E[Y]
Cov(X,Y) = Cov(Y,X)
Var(X) := Cov(X,X) = E [(X − E[X])2]
= E[X2] − (E[X])2
Var(X)
ρX,Y =
Cov(X,Y)
Var(X) Var(Y)
∈ [−1,1]
49
Zero Covariance vs. Independence
If and are independent,

 
 

However, if , and are not necessarily independent.
X Y
Cov(X,Y) = E[XY] − E[X]E[Y]
= E[X]E[Y] − E[X]E[Y]
= 0
Cov(X,Y) = 0 X Y
50
Conditional Expectation
Two random variables and conditioned on :

In general:

X Y Z = z
E[ f(X,Y) |Z = z] =
∑x∈ΩX∑y∈ΩY f(x, y)p(x, y |z) X is discrete
∫∞−∞ ∫
∞
−∞ f(x, y)p(x, y |z)dxdy X is continuous
E[ f(X1,…,Xn) |Z1 = z1,…,Zl = zl] =
∑x1∈ΩX1
⋯∑xn∈ΩXn f(x1,…, xn)p(x1,…, xn |z1,…, zl) X is discrete
∫∞−∞⋯ ∫
∞
−∞ f(x1,…, xn)p(x1,…, xn |z1,…, zl)dx1⋯dxn X is continuous
51
Conditional Moments
Conditioning on one variable:

Conditional mean:  
Conditional variance:

In general:

Conditional mean:  
Conditional variance:

Law of total expectation:

E[X |Z = z]
Var(X |Z = z) = E [(X − E[X |Z = z])2 |Z = z]
E[X |Z1 = z1,…,Zl = zl]
Var(X |Z1 = z1,…,Zl = zl) = E [(X − E[X |Z1 = z1,…,Zl = zl])2 |Z1 = z1,…,Zl = zl]
EZ1,…,Zl[EX[ f(X,Z1,…,Zl) |Z1,…,Zl]] = EX,Z1,…,Zl[ f(X,Z1,…,Zl)]
52
Entropy
Entropy measures the amount of uncertainty in a discrete distribution.

For a discrete RV , the entropy of is defined as:

(When evaluating the above expression, should be treated as if It evaluates
to )

Typically the base of the logarithm is or . The units of entropy are known as
“nats” if the base is , and “bits” if the base is . When the base is not specified, in
ML, typically the base is assumed to be .
X X
H(X) = E[−logb p(X)] = − ∑
x∈ΩX
p(x)logb(p(x))
0 log 0
0
b e 2
e 2
e
53
Entropy
Properties:
For any discrete RV , .

if and only if is deterministic.

is maximized when is the same
for all (i.e.: when the distribution
is discrete uniform)
X H(X) ≥ 0
H(X) = 0 X
H(X) p(x)
x ∈ ΩX
54
Credit: Ethan Weinberger
Joint Entropy
Joint entropy measures the total amount of uncertainty in a discrete joint distribution.

Two discrete RVs:

In general:

(When evaluating the above expressions, should be treated as if It evaluates to
)
H(X,Y) = E[−logb p(X,Y)] = − ∑
x∈ΩX
∑
y∈ΩY
p(x, y)logb(p(x, y))
H(X1,…,Xn) = E[−logb p(X1,…,Xn)] = − ∑
x1∈ΩX1
⋯ ∑
xn∈ΩXn
p(x1,…, xn)logb(p(x1,…, xn))
0 log 0
0
55
Conditional Entropy
Two discrete RVs:

 
 
 

(When evaluating the above expressions, should be treated as if It evaluates to )
H(X |Y = y) = EX[−logb p(X |y) |Y = y] = − ∑
x∈ΩX
p(x |y)logb(p(x |y))
H(X |Y) = EY[EX[−logb p(X |Y) |Y]]
= EX,Y[−logb p(X |Y)]
= − ∑
x∈ΩX
∑
y∈ΩY
p(x, y)logb(p(x |y))
= − ∑
x∈ΩX
∑
y∈ΩY
(p(x, y)logb(p(x, y)) − p(x, y)logb(p(y)))
0 log 0 0
56
Conditional Entropy
In general:

 
 
 

(When evaluating the above expressions, should be treated as if It evaluates to )

H(X |Y1 = y1,…,Yl = yl) = EX[−logb p(X |y1,…, yl) |Y1 = y1,…,Yl = yl] = − ∑
x∈ΩX
p(x |y1,…, yl)logb(p(x |y1,…, yl))
H(X |Y1,…,Yl) = EY1,…,Yl[EX[−logb p(X |Y1,…,Yl) |Y1,…,Yl]]
= EX,Y1,…,Yl[−logb p(X |Y1,…,Yl)]
= − ∑
x∈ΩX
∑
y1∈ΩY1
⋯ ∑
yl∈ΩYl
p(x, y1,…, yl)logb(p(x |y1,…, yl))
= − ∑
x∈ΩX
∑
y1∈ΩY1
⋯ ∑
yl∈ΩYl
(p(x, y1,…, yl)logb(p(x, y1,…, yl)) − p(x, y1,…, yl)logb(p(y1,…, yl)))
0 log 0 0
57