Python pytorch代写-CMPT 726 /-Assignment 1|学霸联盟

Python pytorch代写-CMPT 726 /-Assignment 1

时间：2022-04-07

Machine Learning
CMPT 726 / 419
Ke Li

SFU School of Computing Science

March 4, 2022
1
Announcements
Assignment 1 regrade deadline on Sunday at 11:59pm.

Assignment 2 office hours next Monday from 2pm to 4pm (Q2: 2pm to 3pm,
Q1: 3pm to 4pm).
2
Recap
3
Probabilistic Model
Suppose we know how the data was generated (known as the “data generating process”):

, where

and are unknown parameters

We observe many tuples , denoted as , which we assume are
generated i.i.d. from the process above.

Terminology: i.i.d. stands for “independent and identically distributed”, which means that
each tuple is independent of other tuples and has the same joint distribution as any
other tuple.

Goal: Estimate the values of unknown parameters from observations (known as “parameter
estimation”).
y = ⃗w⊤ ⃗x + σϵ ϵ ∼ (0,1)
⃗w σ
( ⃗x , y) {( ⃗x i, yi)}Ni=1 =:
( ⃗x i, yi)
4
Probabilistic Model
A probabilistic model assigns a probability to every possible observation.

We choose a probabilistic model based on the data generating process:

Recall: data generating process: , where

Recall: if ,

So,

We can write down the expression for :

Recall: If ,

y = ⃗w⊤ ⃗x + σϵ ϵ ∼ (0,1)
Z ∼ (0,1) μ + σZ ∼ (μ, σ2)
y | ⃗x , ⃗w , σ ∼ ( ⃗w⊤ ⃗x , σ2)
p(y | ⃗x , ⃗w , σ)
X ∼ (μ, σ2) p(x) = 1
2πσ2
exp(− (x − μ)
2
2σ2 )
p(y | ⃗x , ⃗w , σ) = 1
2πσ2
exp(− (y − ⃗w
⊤ ⃗x )2
2σ2 )
5
Parameter Estimation
We have two unknown parameters: and .

For the purposes of making predictions, we care mostly about and so will focus
on estimating .

In general:

The parameters we care about are known as parameters of interest.

The parameters we don’t care about are known as nuisance parameters.

In this case, is the parameter of interest, and is the nuisance parameter.

⃗w σ
⃗w
⃗w
⃗w σ
6
Maximum Likelihood Estimation (MLE)
Idea: Find the parameter value at which the probability of observing the data is maximized.

Step 1: Derive the joint probability density of the observations .

(Conditioned on , the ’s are independent)

(Conditioned on each , and ’s for are independent)

p( | ⃗w , σ)
p( | ⃗w , σ) := p(y1,…, yN | ⃗x 1,…, ⃗x N, ⃗w , σ)
=
N
∏
i=1
p(yi | ⃗x 1,…, ⃗x N, ⃗w , σ) ⃗x 1,…, ⃗x N yi
=
N
∏
i=1
p(yi | ⃗x i, ⃗w , σ) ⃗x i ⃗y i ⃗x j j ≠ i
=
N
∏
i=1
1
2πσ2
exp(− (yi −
⃗w⊤ ⃗x i)2
2σ2 )
:=ℒ( ⃗w , σ; {( ⃗x i, yi)}Ni=1) =ℒ( ⃗w , σ;)
7
“Likelihood function”
Maximum Likelihood Estimation (MLE)
Step 2: Find the value of the parameter of interest that maximizes the likelihood function.

To find it, we can try computing the gradient and set it to zero.

̂⃗w MLE = argmax⃗w ℒ( ⃗w , σ;)
∇ ⃗wℒ( ⃗w , σ;) = ∇ ⃗w(
N
∏
i=1
p (yi | ⃗x i, ⃗w , σ))
=
N
∑
i=1
∇ ⃗w p(yi | ⃗x i, ⃗w , σ)∏
j≠i
p(yj | ⃗x j, ⃗w , σ)
8
Unwieldy!
Maximum Likelihood Estimation (MLE)
Step 2: Find the value of the parameter of interest that maximizes the likelihood function.

Instead use the fact that . 
(This is true because is strictly increasing, i.e.: as increases, increases)

This turns the expression into a sum, which is easy to differentiate.

̂⃗w MLE = argmax⃗w ℒ( ⃗w , σ;)
argmax
⃗w
ℒ( ⃗w , σ;) = argmax
⃗w
logℒ( ⃗w , σ;)
x↦ log(x) x log(x)
logℒ( ⃗w , σ;) = log
N
∏
i=1
p(yi | ⃗x i, ⃗w , σ) =
N
∑
i=1
log p(yi | ⃗x i, ⃗w , σ)
∇ ⃗w logℒ( ⃗w , σ;) = ∇ ⃗w(
N
∑
i=1
log p(yi | ⃗x i, ⃗w , σ)) =
N
∑
i=1
∇ ⃗w log p(yi | ⃗x i, ⃗w , σ)
9
“Log-likelihood
function”
When the base of the log is not shown,
it defaults to e (natural number)
Maximum Likelihood Estimation (MLE)
Step 2: Find the value of the parameter of interest that maximizes the likelihood function.

̂⃗w MLE = argmax⃗w ℒ( ⃗w , σ;) = argmax⃗w logℒ( ⃗w , σ;)
logℒ( ⃗w , σ;) =
N
∑
i=1
log p(yi | ⃗x i, ⃗w , σ)
=
N
∑
i=1
log( 12πσ2 exp(−
(yi − ⃗w⊤ ⃗x i)2
2σ2 ))
=
N
∑
i=1 (−
(yi − ⃗w⊤ ⃗x i)2
2σ2
− log 2πσ2)
= −
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2
2σ2
− N log 2πσ2
= − 1
2σ2
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2 − N log 2πσ2
10
Maximum Likelihood Estimation (MLE)

̂⃗w MLE = argmax⃗w logℒ( ⃗w , σ;)
= argmax
⃗w (− 12σ2
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2 − N log 2πσ2)
= argmin⃗
w ( 12σ2
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2 + N log 2πσ2)
= argmin⃗
w
1
2σ2
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2
= argmin⃗
w
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2
11
L( ⃗w ) =
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2Compare to the loss function of OLS:
Frequentist vs. Bayesian Statistics
Previously we treated parameters as fixed (albeit unknown) quantities.

Suppose we know what values of the parameters are more plausible
compared to others. This is known as prior belief or prior knowledge.

We can represent this prior belief as a probability distribution over
parameters, which is known as a prior probability distribution.

Instead of treating parameters as fixed quantities, we treat the parameters
as random variables that follow the prior probability distribution.

This latter approach is known as Bayesian statistics, whereas the former
approach is known as frequentist statistics.
12
Maximum A Posteriori Estimation (MAP)
In addition to the probabilistic model, we have a prior distribution over the
parameters of interest .

Idea: Find the parameter value that maximizes the conditional probability over
parameters given the observations .

We can use Bayes’ rule to find :

p( ⃗θ )
⃗θ
p( ⃗θ |)
p( ⃗θ |)
p( ⃗θ |) = p(
⃗θ )p( | ⃗θ )
p()
̂⃗θ MAP = argmax⃗
θ
p( ⃗θ |)
13
“Likelihood”“Prior”
“Posterior” The term “likelihood” is used in broader contexts than “likelihood function”. While its meaning coincides with
“likelihood function” in this context, it could mean the
probability of a single observation or any
where we observe and would like to infer
an unobserved based on .
p(y | ⃗x , ⃗θ )
p( ⃗x | ⃗y ) ⃗x
⃗y ⃗x
Maximum A Posteriori Estimation (MAP)
Consider the same probabilistic model as before:

We choose the following prior distribution over the parameter of interest :

y | ⃗x , ⃗w , σ ∼ ( ⃗w⊤ ⃗x , σ2)
p(y | ⃗x , ⃗w , σ) = 1
2πσ2
exp(− (y − ⃗w
⊤ ⃗x )2
2σ2 )
⃗w
⃗w |σ ∼ ( ⃗0 , σ
2
λ
I)
p( ⃗w |σ) = 1
(2π)n det( σ2λ I)
exp − 1
2
⃗w⊤( σ
2
λ
I)
−1
⃗w = 1
( 2πσ2λ )
n
exp(− λ2σ2∥ ⃗w∥22)
14
Maximum A Posteriori Estimation (MAP)
 
 
 
 
 
 
 

̂⃗w MAP = argmax⃗w p( ⃗w |, σ)
= argmax
⃗w
log p( ⃗w |, σ)
= argmax
⃗w
log(p( ⃗w |σ)p( | ⃗w , σ)p( |σ) )
= argmax
⃗w
log p( ⃗w |σ) + log p( | ⃗w , σ) − log p( |σ)
= argmax
⃗w
log p( ⃗w |σ) + log p( | ⃗w , σ)
= argmax
⃗w
log p( ⃗w |σ) + log p(y1,…, yN | ⃗x 1,…, ⃗x N, ⃗w , σ)
= argmax
⃗w
log p( ⃗w |σ) + log
N
∏
i=1
p(yi | ⃗x i, ⃗w , σ)
= argmax
⃗w
log p( ⃗w |σ) +
N
∑
i=1
log p(yi | ⃗x i, ⃗w , σ)
15
Maximum A Posteriori Estimation (MAP)
 
 
 

̂⃗w MAP = argmax⃗w log p( ⃗w |σ) +
N
∑
i=1
log p(yi | ⃗x i, ⃗w , σ)
= argmax
⃗w
log 1
( 2πσ2λ )
n
exp(− λ2σ2∥ ⃗w∥22) +
N
∑
i=1
log 1
2πσ2
exp(− (y − ⃗w
⊤ ⃗x )2
2σ2 )
= argmax
⃗w
− λ
2σ2
∥ ⃗w∥22 − log ( 2πσ
2
λ )
n
+
N
∑
i=1 (−
(y − ⃗w⊤ ⃗x )2
2σ2
− log( 2πσ2))
= argmax
⃗w
− λ
2σ2
∥ ⃗w∥22 +
N
∑
i=1
− (y −
⃗w⊤ ⃗x )2
2σ2
16
Maximum A Posteriori Estimation (MAP)
 
 
 
 

̂⃗w MAP = argmax⃗w −
λ
2σ2
∥ ⃗w∥22 +
N
∑
i=1
− (y −
⃗w⊤ ⃗x )2
2σ2
= argmax
⃗w
− 1
2σ2 (λ∥ ⃗w∥22 +
N
∑
i=1
(y − ⃗w⊤ ⃗x )2)
= argmin⃗
w
1
2σ2 (λ∥ ⃗w∥22 +
N
∑
i=1
(y − ⃗w⊤ ⃗x )2)
= argmin⃗
w
λ∥ ⃗w∥22 +
N
∑
i=1
(y − ⃗w⊤ ⃗x )2
17
L( ⃗w ) =
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2 + λ∥ ⃗w∥22Compare to the loss function of ridge regression:
OLS vs. Ridge Regression
(Deterministic Interpretation)
Model:

Parameters:

OLS:
Loss function:

 

Optimal Parameters:

Ridge Regression:
Loss function:

 

Optimal Parameters:
̂y = ⃗w⊤ ⃗x
⃗w
L( ⃗w ) =
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2
= ∥ ⃗y − X ⃗w∥22
⃗w* := argmin⃗
w
L( ⃗w ) = (X⊤X)−1X⊤ ⃗y
L( ⃗w ) =
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2 + λ∥ ⃗w∥22
= ∥ ⃗y − X ⃗w∥22 + λ∥ ⃗w∥22
⃗w* := argmin⃗
w
L( ⃗w ) = (X⊤X + λI)−1X⊤ ⃗y
18
OLS vs. Ridge Regression
(Probabilistic Interpretation)
Probabilistic Model:

Parameter of Interest:  
Nuisance Parameter:

OLS:
MLE estimate:

 
 

Prior:

Ridge Regression:
MAP estimate:

   
 

⃗w
σ
̂⃗w MLE = argmin⃗
w
N
∑
i=1
(yi − ⃗w⊤ ⃗x i)2
= argmin⃗
w
∥ ⃗y − X ⃗w∥22
= (X⊤X)−1X⊤ ⃗y
⃗w |σ ∼ ( ⃗0 , σ
2
λ
I)
̂⃗w MAP = argmin⃗
w
N
∑
i=1
(y − ⃗w⊤ ⃗x )2 + λ∥ ⃗w∥22
= argmin⃗
w
∥ ⃗y − X ⃗w∥22 + λ∥ ⃗w∥22
= (X⊤X + λI)−1X⊤ ⃗y
19
y | ⃗x , ⃗w , σ ∼ ( ⃗w⊤ ⃗x , σ2)
Takeaways
There are deep connections between deterministic and probabilistic formulations of machine learning methods.

Many methods have both deterministic and probabilistic interpretations.

Primary loss function is related to the likelihood.

Regularizer is related to the prior.

Square losses are related to (univariate) Gaussian observation noise.

Squared L2 losses are related to (multivariate) isotropic Gaussian observation noise.

p(x) = 1
2πσ2
exp(− (x − μ)
2
2σ2 )
p( ⃗x ) = 1
(2πσ2)n
exp(− 12σ2∥ ⃗x − ⃗μ ∥22)
20
(Full) Bayesian Inference
Bayesian inference refers to computing the full posterior:

Compare to MAP:

Unlike MAP estimation, it does not just produce a single value for the parameter estimate (known
as a point estimate). Instead, it produces a full distribution over possible parameter values.

Can be extremely computationally challenging - computing the posterior exactly is intractable for
all but some special cases. Must typically rely on sampling or approximations.
p( ⃗θ |) = p(
⃗θ )p( | ⃗θ )
p()
̂⃗θ MAP = argmax⃗
θ
p( ⃗θ |)
21
Terminology: “Modes”
A mode of a probability distribution is:

• in the case of continuous distribution: a local
maximum of the probability density

• In the case of discrete distribution: the values at
which the probability mass is the highest

“Unimodal”: a probability distribution with a single
mode (e.g.: Gaussian, Gamma, log-normal, etc.)

“Bimodal”: a probability distribution with two modes
(e.g.: a mixture of two Gaussians)

“Multimodal”: a probability distribution with many
modes (e.g.: a mixture of k Gaussians)
22 Credit: Wikipedia
Probabilistic Interpretation
of Linear Regression
23
Full Bayesian Inference vs. MAP
When the posterior is unimodal:

• There is one parameter value that is the most plausible given the data.

• The MAP estimate corresponds to this parameter value.

• The MAP estimate would capture the most important information of the posterior well.

When the posterior is multimodal:

• There could be multiple parameter values that are all quite plausible (or even equally plausible).

• The MAP estimate corresponds to one of these parameter values (in the case of multiple
equally plausible values, which one it corresponds to is arbitrary).

• The MAP estimate would lose a lot of information in the posterior.
p( ⃗θ |)
p( ⃗θ |)
24
Terminology: “Inference”
Parameter estimation is also known as inference (especially in the context of Bayesian inference), since the
purpose is to infer the value of unknown parameters from observed data.

Note: In machine learning, the term “inference” is overloaded and can sometimes mean testing/making
predictions (as opposed to training).

This can get confusing, because parameter estimation in a probabilistic model corresponds to training the
model, which is also known as inference. On the other hand, testing the model is also sometimes known as
inference. So, inference can mean either training or testing!

Meaning depends on context:

• “inference time”, “training and inference”: means testing

• “Bayesian inference”, “probabilistic inference”, “parameter inference”: means training

• “inference procedure”: ambiguous

In this course, we will avoid using the term as much as possible to minimize confusion.
25
Terminology: “A Priori” vs. “A Posteriori”
“A Priori” (literally means “from the earlier”): Prior knowledge that we have before seeing the data

“A Posteriori” (literally means “from the later”): Knowledge acquired after seeing the data

Examples:

“The degree of the polynomial is not known a priori” means that we don’t know what the degree of the
polynomial should be without seeing the data.

“The a posteriori estimates are mostly concentrated around 0.5” means that after seeing the data, most of
the estimates are near 0.5.

p( ⃗θ |) = p(
⃗θ )p( | ⃗θ )
p()
26
“Likelihood”“Prior”
“Posterior”
In-Class Quiz
Q3: Which of the following statements about MAP and full Bayesian
inference is NOT true?
(A) MAP produces a point estimate, whereas full Bayesian inference does
not
(B) The MAP estimate is always a mode of the distribution produced by full
Bayesian inference
(C) MAP involves optimizing a function, whereas full Bayesian inference
does not
(D) MAP adds an additional optimization step on top of full Bayesian
inference and so is more expensive to compute
(E) MAP uses a prior, and so does full Bayesian inference
(F) All are true
(G) Don’t know
27
Bias-Variance Tradeoff
28
Understanding Overfitting
Recall: Overfitting happens when the
model is fitting to the noise in the training
data.

So, when overfitting happens, different
training datasets can result in very different
optimal parameter values, which often
result in wrong predictions for some inputs.

Ideally, we would like to use a family of
models that would typically produce
accurate predictions on unseen testing
data, regardless of the particular training
dataset.
29
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
30
Bias Squared
Variance
Irreducible Error
Expected Error
Bias-Variance Decomposition

ϵ( ⃗x⏟ , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)
2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
31
Bias Squared
Variance
Irreducible Error
Expected Error
Unseen Testing
Example
Bias-Variance Decomposition

ϵ( ⃗x , f
⏟
,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
32
Bias Squared
Variance
Irreducible Error
Expected Error
Model
Bias-Variance Decomposition

ϵ( ⃗x , f, L
⏟
) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
33
Bias Squared
Variance
Irreducible Error
Expected Error
Loss Function
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(⏟ ,L)) − y)
2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
34
Bias Squared
Variance
Irreducible Error
Expected Error
Training Dataset
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
35
Bias Squared
Variance
Irreducible Error
Expected Error
Optimal parameters on training data
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
36
Bias Squared
Variance
Irreducible Error
Expected Error
Prediction of trained model on new testing example
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y
⏟
)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
37
Bias Squared
Variance
Irreducible Error
Expected Error
(Possibly noisy) label corresponding to testing example
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
38
Bias Squared
Variance
Irreducible Error
Expected Error
Mean squared error (MSE) on testing example
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
39
Bias Squared
Variance
Irreducible Error
Expected Error
Testing error averaged over training datasets and different noisy versions of the label
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
40
Bias Squared
Variance
Irreducible Error
Expected Error
Prediction of the trained model on testing example, averaged over training datasets
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
41
Bias Squared
Variance
Irreducible Error
Expected Error
Average of noisy versions of the label for the testing example
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
42
Bias Squared
Variance
Irreducible Error
Expected Error
Squared difference between average prediction and average label
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
43
Bias Squared
Variance
Irreducible Error
Expected Error
Variance of the prediction on the testing example over different training datasets
Bias-Variance Decomposition

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
44
Bias Squared
Variance
Irreducible Error
Expected Error
Variance of different noisy versions of the label of the testing example
Bias-Variance Tradeoff

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
45
Bias Squared: Small
Variance: Large
Irreducible Error
Expected Error
When overfitting:
Bias-Variance Tradeoff

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
46
Bias Squared: Large
Variance: Small
Irreducible Error
Expected Error
When underfitting:
Bias-Variance Tradeoff

ϵ( ⃗x , f,L) = Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])2
+Var( f( ⃗x ; θ*(,L)) | ⃗x )
+Var(y | ⃗x )
47
Bias Squared: Small
Variance: Small
Irreducible Error
Expected Error
When fitted just right:
Bias-Variance Tradeoff
Proof:

 
 
 
 
 
Ey,[( f( ⃗x ; θ*(,L)) − y)2 | ⃗x ]
= E[ f( ⃗x ; θ*(,L))2 | ⃗x ] − 2E,y[ f( ⃗x ; θ*(,L)) ⋅ y | ⃗x ] + Ey[y2 | ⃗x ]
= (Var( f( ⃗x ; θ*(,L)) | ⃗x ) + E[ f( ⃗x ; θ*(,L)) | ⃗x ]2) − 2E,y[ f( ⃗x ; θ*(,L)) ⋅ y | ⃗x ] + (Vary(y | ⃗x ) + Ey[y | ⃗x ]2)
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ]2 − 2E,y[ f( ⃗x ; θ*(,L)) ⋅ y | ⃗x ] + Ey[y | ⃗x ]2) + Var( f( ⃗x ; θ*(,L)) | ⃗x ) + Vary(y | ⃗x )
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ]2 − 2E[ f( ⃗x ; θ*(,L)) | ⃗x ]Ey[y | ⃗x ] + Ey[y | ⃗x ]2) + Var( f( ⃗x ; θ*(,L)) | ⃗x ) + Vary(y | ⃗x )
= (E[ f( ⃗x ; θ*(,L)) | ⃗x ] − Ey[y | ⃗x ])
2
+ Var( f( ⃗x ; θ*(,L)) | ⃗x ) + Vary(y | ⃗x )
48
Takeaways
Expected Error on Testing Example = Bias2 + Variance + Irreducible Error

Bias: Difference between average prediction and average label of the testing example

Variance: Variance of the prediction on the testing example over different training
datasets

Both bias and variance must be low in order for the expected error to be low

When overfitting, bias is low and variance is high

When undercutting, bias is high and variance is low

Simply achieving low bias (which happens when the model is very expressive) isn’t
enough!
49
Nonlinear Regression
50
Nonlinear Least Squares
Recall: By replacing raw data with features, we can make the prediction
depend non-linearly on the raw data.

However, the prediction still depends linearly on the parameters .

Can we make the prediction depend non-linearly on the parameters?

E.g.:
̂y = ⃗w⊤ϕ( ⃗x )
̂y ⃗w
̂y = ( ⃗w⊤ ⃗x )2
51
Nonlinear Least Squares
Suppose we have the following data generating process:

, where

So,

Hence the likelihood function is:

How do we find the optimal parameters ?
y = f( ⃗x ; ⃗w ) + σϵ ϵ ∼ (0,1)
y | ⃗x , ⃗w , σ ∼ ( f( ⃗x ; ⃗w ), σ2)
ℒ( ⃗w , σ;) =
N
∏
i=1
1
2πσ2
exp(− (yi − f(
⃗x i; ⃗w ))2
2σ2 )
̂⃗w MLE = argmax⃗w logℒ( ⃗w , σ;) = argmin⃗w
N
∑
i=1
(yi − f( ⃗x i; ⃗w ))2
̂⃗w MLE
52
Nonlinear Least Squares
To summarize, nonlinear least squares requires solving the following:

This can be viewed as solving an instance of the following more general problem:

In this case, and .

Such a problem is known as an optimization problem, or more specifically, an
unconstrained minimization problem.
̂⃗w MLE = argmax⃗w logℒ( ⃗w , σ;) = argmin⃗w
N
∑
i=1
(yi − f( ⃗x i; ⃗w ))2
⃗θ * = argmin⃗
θ
L( ⃗θ )
⃗θ = ⃗w L( ⃗θ ) = − logℒ( ⃗θ , σ;)
53
is known as the “objective function”L
is known as the “objective value”L( ⃗θ )
Nonlinear Least Squares
Goal: Find

The first step to finding a global minimum is to find a critical point. We can try to find
the critical points by setting the gradient to zero:

 
Cannot solve this in closed form in general because the form of could be arbitrary.
⃗θ * = argmin⃗
θ
L( ⃗θ )
∂
∂ ⃗w (
N
∑
i=1
(yi − f( ⃗x i; ⃗w ))2) =
N
∑
i=1
∂
∂ ⃗w (yi − f( ⃗x i; ⃗w ))
2
= −
N
∑
i=1
2(yi − f( ⃗x i; ⃗w )) ∂f∂ ⃗w ( ⃗x i; ⃗w ) =
⃗0
∂f
∂ ⃗w
54
Optimization
55
Iterative Optimization Algorithms
Goal: Find

Because we cannot in general solve for the critical point analytically, we find it
numerically using an iterative optimization algorithm.

The simplest of these algorithms is gradient descent.

 
 
 
 

⃗θ * = argmin⃗
θ
L( ⃗θ )
⃗θ (0) ← random vector
for t = 1,2,3,…
⃗θ (t) ← ⃗θ (t−1) − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
56
is a hyperparameter and is known as
the “step size” or “learning rate”
γt
Iterative Optimization Algorithms
Gradient Descent:
 
 
 
 

⃗θ (0) ← random vector
for t = 1,2,3,…
⃗θ (t) ← ⃗θ (t−1) − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
57
Credit: Pierre Vigier
Iterative Optimization Algorithms
Gradient Descent:
 
 
 
 

⃗θ (0) ← random vector
for t = 1,2,3,…
⃗θ (t) ← ⃗θ (t−1) − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
58
Credit: Ayoosh Kathuria