Python pytorch代写-CMPT 726 /|学霸联盟

Python pytorch代写-CMPT 726 /

时间：2022-04-07

Machine Learning
CMPT 726 / 419
Ke Li

SFU School of Computing Science

March 16, 2022
1
Announcements
Deadline for academic dishonesty amnesty application is this coming
Sunday, March 20th.

If you committed an academic offence and do not fill out the form, the
standard penalty will apply (at least zero or -100% for a single offence, F in
the course for multiple offences), without exception.
2
Recap
3
Nesterov’s Accelerated Gradient (NAG)
Gradient Descent with Momentum:
 
 
 
 
 
 

⃗θ (0) ← random vector
Δ ⃗θ = ⃗0
for t = 1,2,3,…
Δ ⃗θ ← αΔ ⃗θ − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
⃗θ (t) ← ⃗θ (t−1) + Δ ⃗θ
if algorithm has converged
return ⃗θ (t)
4
Nesterov’s Accelerated Gradient:
 
 
 
 
 
 

⃗θ (0) ← random vector
Δ ⃗θ = ⃗0
for t = 1,2,3,…
Δ ⃗θ ← αΔ ⃗θ − γt ∂L
∂ ⃗θ
( ⃗θ (t−1) + αΔ ⃗θ )
⃗θ (t) ← ⃗θ (t−1) + Δ ⃗θ
if algorithm has converged
return ⃗θ (t)
Lookahead: account for the
inertia when choosing where
to compute gradient
Adaptive Gradient Methods
Recall the function .

The eigenvalues of the Hessian (which in this
case are the diagonal entries) have very
different magnitudes.

Gradient descent converges to the optimal
parameters slowly.
f(x, y) = 4x2 + (y + 3)
2
15
∇2f(x, y) =
∂2f
∂x∂x (x, y)
∂2f
∂x∂y (x, y)
∂2f
∂y∂x (x, y)
∂2f
∂y∂y (x, y)
= (
8 0
0 215) ≻ 0
5
Adaptive Gradient Methods
Recall the function .

Idea: Try to make the coefficient of each term
1.

Let , , so ,
.
f(x, y) = 4x2 + (y + 3)
2
15
∇2f(x, y) =
∂2f
∂x∂x (x, y)
∂2f
∂x∂y (x, y)
∂2f
∂y∂x (x, y)
∂2f
∂y∂y (x, y)
= (
8 0
0 215) ≻ 0
x′ = 2x y′ = y
15
x = x′
2
y = 15y′
6
“Reparameterization”
Adaptive Gradient Methods
Let , , so , .

 
 
 
 

x′ = 2x y′ = y
15
x = x′
2
y = 15y′
f(x, y) = 4x2 + (y + 3)
2
15
= 4( x′ 2)
2
+
( 15y′ + 3)2
15
= 4( x′ 2)
2
+
( 15(y′ + 315))
2
15
= 4( x
′ 2
4 ) +
15(y′ + 315)
2
15
= x′ 2 +(y′ + 315)
2
7
Gradient descent converges
quickly after reparameterization!
Adaptive Gradient Methods
In general, suppose we reparameterize ,
so . We can then write as

 
 

Challenge: We don’t know which to choose to
make gradient descent converge quickly.
⃗θ ′ = A−1 ⃗θ
⃗θ = A ⃗θ ′ L( ⃗θ ) L(A ⃗θ ′ )
∂L
∂ ⃗θ ′
=
∂(A ⃗θ ′ )
∂ ⃗θ ′
∂L
∂(A ⃗θ ′ )
=
∂(A ⃗θ ′ )
∂ ⃗θ ′
∂L
∂θ
= A⊤ ∂L
∂θ
∂L
∂θ
= (A⊤)−1 ∂L
∂ ⃗θ ′
A
8
Adaptive gradient methods pick
adaptively based on past gradients,
effectively reparameterizing on-the-fly.
A
AdaGrad
Gradient Descent:
 
 
 
 
 
 
 
 

⃗θ (0) ← random vector
for t = 1,2,3,…
⃗θ (t) ← ⃗θ (t−1) − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
9
AdaGrad (short for “adaptive gradient”):
 
 
 
 
 

⃗θ (0) ← random vector
for t = 1,2,3,…
D← diag
t−1
∑
t′ =1(
∂L
∂θi
( ⃗θ (t′ )))
2
n
i=1
⃗θ (t) ← ⃗θ (t−1) − γtD−1 ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
“Preconditioner”
Adam
Gradient Descent with Momentum:
 
 
 
 
 
 

⃗θ (0) ← random vector
Δ ⃗θ = ⃗0
for t = 1,2,3,…
Δ ⃗θ ← αΔ ⃗θ − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
⃗θ (t) ← ⃗θ (t−1) + Δ ⃗θ
if algorithm has converged
return ⃗θ (t)
10
Adam (short for “adaptive moment estimation”):
 
 
 
 
 
 

⃗θ (0) ← random vector
for t = 1,2,3,…
⃗m ← α ⃗m + (1 − α) ∂L
∂ ⃗θ
( ⃗θ (t−1))
D← βD + (1 − β)diag ( ∂L∂θi ( ⃗θ (t′ )))
2 n
i=1
⃗θ (t) ← ⃗θ (t−1) − γt(( 11 − βt D)
1/2
+ ϵI)
−1
( 11 − αt ⃗m)
if algorithm has converged
return ⃗θ (t)
Optimization
11
Non-Diagonal Preconditioners
Previous methods use a diagonal
preconditioner, which corresponds to
choosing a diagonal reparameterization
matrix .

If is diagonal, . So,
the reparameterization can only scale along
each axis.

In some cases, this is less than ideal.
A
A ⃗θ = A ⃗θ ′ =
a1,1θ1
⋮
an,nθn
12
Non-Diagonal Preconditioners
Example: rotate the function by 45 degrees.

We apply the transformation
.

 

Both and are in both terms, cannot make the coefficient in each
term 1 by scaling.

Instead, we can reparameterize using a non-diagonal matrix .
f(x, y) = 4x2 + (y + 3)
2
15
(xy)↦ 1/ 2 −1/ 21/ 2 1/ 2 (
x
y) = x/ 2 − y/ 2x/ 2 + y/ 2
f(x, y) = 4(x/ 2 − y/ 2)
2
+
(x/ 2 + y/ 2 + 3)
2
15
= 2 (x − y)2 + (
x + y + 3 2)
2
30
x y
A
13
Newton’s Method
Idea: Use inverse Hessian as preconditioner.

Known as a second-order optimization method because it uses the Hessian (from the second-order
term in the Taylor expansion) in addition to the gradient (from the first-order term in the Taylor expansion).

Methods that only use the gradient are known as first-order methods.

14
Newton’s Method:
 
 
 
 
 

⃗θ (0) ← random vector
for t = 1,2,3,…
H← ∂
2L
∂ ⃗θ ∂ ⃗θ ⊤
( ⃗θ (t−1))
⃗θ (t) ← ⃗θ (t−1) − γtH−1 ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
Gradient Descent:
 
 
 
 
 
 

⃗θ (0) ← random vector
for t = 1,2,3,…
⃗θ (t) ← ⃗θ (t−1) − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
Newton’s Method
Benefits:
Invariant to invertible linear reparameterizations.

When initialized sufficiently close to a local minimum, Newton’s method will
converge very quickly at a quadratic rate (that is, as the number of iterations
increases, the gap between the local minimum and the objective value
decreases at a rate of ).
t
O(e−et)
15
Newton’s Method
Drawbacks:
May not try to minimize the objective function - may actually maximize it!

• Recall the update formula:

• If the Hessian is negative definite, will move in the opposite direction from the direction of
steepest descent.

When not initialized close to a local minimum, may diverge!

• If the Hessian has small eigenvalues, the inverse Hessian will have large eigenvalues, causing the
update to blow up.

The Hessian is very expensive (or even intractable) to compute because the number of entries is
quadratic in the dimensions of the parameter vector.
⃗θ (t) ← ⃗θ (t−1) − γtH−1 ∂L
∂ ⃗θ
( ⃗θ (t−1))
16
Comparison of Optimization Algorithms
17
There is no single best optimization algorithm -
which one works best depends on the objective function.
Learning to Optimize
Possible to discover new optimization algorithms automatically using machine learning -
one of my research topics.

Idea: Replace the update formula with a model, and then find the parameters of the model
that would make the optimization algorithm converge the fastest. Effectively learning the
update formula from data on optimizing functions.

18
“Learning to Optimize”, Ke Li, Jitendra Malik, arXiv:1606.01885, 2016 and ICLR, 2017
In-Class Quiz
Q1: Which of the following optimization algorithms uses a non-diagonal
preconditioner?
(A) Gradient descent
(B) Gradient descent with momentum
(C) Nesterov’s accelerated gradient
(D) AdaGrad
(E) Adam
(F) Newton’s method
(G) All are true
(H) Don’t know
19
Stochastic Optimization
Also known as “stochastic approximation” when used in broader contexts, such as root
finding.

Recall the optimization problem we would like to solve:

, where

In each iteration of these optimization algorithms, we need to compute the gradient : 

This involves summing terms. If the dataset is large, computing this is expensive.
⃗θ * = argmin⃗
θ
L( ⃗θ ) L( ⃗θ ) = − logℒ( ⃗θ , σ;) =
N
∑
i=1
(yi − f( ⃗x i; ⃗θ ))2 :=
N
∑
i=1
Li( ⃗θ )
∂L
∂ ⃗θ
( ⃗θ (t−1))
∂L
∂ ⃗θ
( ⃗θ (t−1)) =
N
∑
i=1
∂Li
∂ ⃗θ
( ⃗θ (t−1))
N
20
Stochastic Optimization
Key Idea: Instead of computing all the terms, we can compute a random subset of terms and add
them together. This can be viewed as a noisy version of the true gradient, which is in expectation
the same as the true gradient.

Consider the extreme case of just sampling one term: 
Let be a discrete uniform RV from to inclusive, i.e.: for all . 
 

When the expectation of a random variable is the same as a fixed variable, the random variable is
said to be an unbiased estimate of the fixed variable. In this case, the subsampled gradient
(known as a gradient estimate) is an unbiased estimate of the true gradient.
i′ 1 N Pr(i′ = k) = 1
N
k ∈ {1,…,N}
Ei′ [N ⋅ ∂L′ i∂ ⃗θ ( ⃗θ (t−1))] = N ⋅ E[
∂Li′
∂ ⃗θ
( ⃗θ (t−1))] = N
N
∑
k=1
Pr(i′ = k)
∂Lk
∂ ⃗θ
( ⃗θ (t−1))
= N
N
∑
k=1
1
N
∂Lk
∂ ⃗θ
( ⃗θ (t−1)) =
N
∑
k=1
∂Lk
∂ ⃗θ
( ⃗θ (t−1)) = ∂L
∂ ⃗θ
( ⃗θ (t−1))
21
Stochastic Gradient Descent (SGD)
It turns out that we can replace the true gradient in gradient descent with the
gradient estimate. The resulting algorithm is known as stochastic gradient descent
(SGD).

Stochastic Gradient Descent (SGD):
 
 
Sample from discrete uniform distribution from to inclusive 
 
 

⃗θ (0) ← random vector
for t = 1,2,3,…
i′ 1 N
⃗θ (t) ← ⃗θ (t−1) − γt
∂Li′
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
22
SGD vs. Gradient Descent
SGD comes with two benefits compared to vanilla gradient descent:

• Computing the gradient estimate is much cheaper

• Less prone to getting stuck at a local minimum or saddle point, so tends to
converge faster on non-convex optimization problems

It does have a drawback:

• In the worst case, the number of iterations needed to achieve a given
objective value is increased by a factor of compared to vanilla gradient
descent.
N
23
Terminology
(Full Batch) Gradient Descent: Computing the gradient on the entire dataset every iteration.

Mini-batch Gradient Descent: Computing the gradient on a subset of data points (known as
a mini-batch) every iteration. (Sometimes also referred to as "stochastic gradient descent”)

Stochastic Gradient Descent: Computing the gradient on a single data point every iteration.

The smallest number of iterations needed to perform one pass over the dataset is known as
an epoch. It is: 
- 1 for full batch gradient descent 
- N/M for mini-batch gradient descent where M is the size of the mini-batches 
- N for stochastic gradient descent

In practice, mini-batch gradient descent tends to work the best.
24
Other Stochastic Algorithms
Stochastic optimization can be combined with algorithms other than gradient descent
to yield stochastic versions of those algorithms.

For example, stochastic gradient descent with momentum, stochastic NAG, stochastic
AdaGrad, stochastic Adam.

However, sometimes all these methods are referred to as stochastic gradient methods.

In modern machine learning, because the datasets are typically large, stochastic
versions are often assumed to be used by default, and the word “stochastic” is often
not explicitly mentioned.

While replacing true gradients with gradient estimates is typically fine for first-order
methods, it often does not work well with second-order methods, e.g.: Newton’s
method, and quasi-Newton methods like L-BFGS.
25
Practical Tips for Tuning Hyperparameters
Iterative optimization algorithms have many hyperparameters!

• Step size / learning rate (in all methods): (possibly varying in each
iteration)

• Initial step size / learning rate

• Learning rate schedule: constant, step / piecewise constant, inverse
time, etc.

• Momentum parameter (in momentum and Adam):

• Momentum parameter for the preconditioner (in Adam):
γt
α
β
26
Practical Tips for Tuning Hyperparameters
(Based on personal experience - your mileage
may vary.)

• Step size / learning rate (in all methods):

• Start with a constant learning rate schedule,
and try different values uniformly sampled on
a logarithmic scale, e.g.: 0.1, 0.01, 0.001, etc.

• If objective value keeps going up or oscillates
without going down, try a smaller learning
rate

• If objective value goes down very slowly, try
a larger learning rate
γt
27
Practical Tips for Tuning Hyperparameters
• If objective value goes down steadily
but plateaus at a later iteration, try a
step / piecewise constant learning
rate schedule that decreases the step
size after the point in time at which
the objective value plateaus

• If the parameter vector is low-
dimensional and you are using a
stochastic method, try an inverse
time learning rate schedule
( ). Backed by theory - look
up “Robbins-Monro sequence”.
γt = γ0/t
28
Practical Tips for Tuning Hyperparameters
• Momentum parameter (in momentum and Adam):

• Typically 0.9 works well. Consider increasing to 0.99 or 0.999 if mini-
batch size (number of data points in a mini-batch) is small.

• Momentum parameter for the preconditioner (in Adam):

• Typically 0.999 works well.
α
β
29
Neural Networks
30
Features / Basis Functions
Recall: By replacing raw data with features, we can make the prediction depend
non-linearly on the raw data.

The feature map is fixed. Can we learn the features?

We can replace the feature map with another model and find the optimal
parameters of that model (often referred to as “learning” the parameters of the
model).

This is the idea behind neural networks.

̂y = ⃗w⊤ϕ( ⃗x )
ϕ
31
Neural Networks
We start with the multiple output linear regression model on features:

Let’s rename as and use to denote . So, .

Let’s replace the feature map with , where is some known
function. We can then learn both the features (by learning ) and the parameters .

So, now the model becomes:

, where .

The model parameters consist of both and .
̂⃗y = Wϕ( ⃗x )
W WL ⃗h L ϕ( ⃗x ) ̂⃗y = WL ⃗h L
⃗x ↦ ϕ( ⃗x ) ⃗x ↦ ψ(WL−1 ⃗x ) ψ( ⋅ )⃗h L WL−1 WL
̂⃗y = WL ⃗h L ⃗h L = ψ(WL−1 ⃗x )
WL WL−1
32
Neural Networks
Let’s replace the raw data with features again. The model becomes:

, where .

We use to denote . The model can be then written as:

, where .

Let’s replace the feature map with , where is some known function.
We can then learn both the features (by learning ) and the parameters and .

So, now the model becomes:

, where and .

The model parameters consist of , and .
̂⃗y = WL ⃗h L ⃗h L = ψ(WL−1ϕ( ⃗x ))
⃗h L−1 ϕ( ⃗x )
̂⃗y = WL ⃗h L ⃗h L = ψ(WL−1 ⃗h L−1)
⃗x ↦ ϕ( ⃗x ) ⃗x ↦ ψ(WL−2 ⃗x ) ψ( ⋅ )⃗h L−1 WL−2 WL WL−1
̂⃗y = WL ⃗h L ⃗h L = ψ(WL−1 ⃗h L−1) ⃗h L−1 = ψ(WL−2 ⃗x )
WL WL−1 WL−2
33
Neural Networks
Applying this idea recursively gives us the following model:

, where , , …, and .

The model parameters consist of , , .…, and .

We can write this more compactly:

This model is the simplest type of neural network, known as a multi-layer perceptron (MLP).

Neural networks are also known as neural nets or artificial neural networks.
̂⃗y = WL ⃗h L ⃗h L = ψ(WL−1 ⃗h L−1) ⃗h L−1 = ψ(WL−2 ⃗h L−2) ⃗h 1 = ψ(W0 ⃗x )
WL WL−1 W0
̂⃗y = WLψ(WL−1ψ(WL−2⋯ψ(W0 ⃗x )))
34
Weights and Biases
Multi-layer perceptron model:

, where , , …, and .

Recall: In linear regression, , where and

Here, and

are known as weights, are known as biases, and is known as an activation function.
̂⃗y = WL ⃗h L ⃗h L = ψ(WL−1 ⃗h L−1) ⃗h L−1 = ψ(WL−2 ⃗h L−2) ⃗h 1 = ψ(W0 ⃗x )
̂y = ⃗w⊤ ⃗x ⃗w =
w1
⋮
wn−1
b
⃗x =
x1
⋮
xn−1
1
⃗x =
x1
⋮
xn0−1
1
⃗h l = ψ( ⃗zl) =
g(z1)
⋮
g(znl−1)
1
Wl =
w1,1 ⋯ w1,nl b1
⋮ ⋱ ⋮ ⋮
wnl+1,1 ⋯ wnl+1,nl bnl+1
wi,j bi g( ⋅ )
35
Terminology
Multi-layer perceptron model:

, where , , …, and
.

: Output layer / last layer 
: th hidden layer / ( )th layer / features / (post-)activations 
: Input layer / first layer 
: Pre-activations
̂⃗y = WL ⃗h L ⃗h L = ψ(WL−1 ⃗h L−1) ⃗h L−1 = ψ(WL−2 ⃗h L−2)⃗h 1 = ψ(W0 ⃗x )
̂⃗y
⃗h l l l + 1⃗x
⃗zl := Wl ⃗h l
36
Terminology
Multi-layer perceptron model:

, where ,
, …, and
.

An element of , and is known as a
neuron / a unit.  
Input neuron / input unit: an element of  
Hidden neuron / hidden unit: an element of  
Output neuron / output unit: an element of
̂⃗y = WL ⃗h L ⃗h L = ψ(WL−1 ⃗h L−1)⃗h L−1 = ψ(WL−2 ⃗h L−2)⃗h 1 = ψ(W0 ⃗x )
̂⃗y ⃗h l ⃗x
⃗x
⃗h l̂⃗y
37
w3,5
3,6w
4,5w
4,6w
5
6
w1,3
1,4w
2,3w
2,4w
1
2
3
4
w1,3
1,4w
2,3w
2,4w
1
2
3
4
(b)(a)
Output
Σ
Input
Links
Activation
Function
Input
Function
Output
Links
a0 = 1 aj = g(inj)
aj
ginjwi,j
w0,j
Bias Weight
ai
Credit: Russel & Norvig
Model can be interpreted as a network
Terminology
Multi-layer perceptron model:

, where ,
, …, and
.

An element of , and is known as a
neuron / a unit.  
Input neuron / input unit: an element of  
Hidden neuron / hidden unit: an element of  
Output neuron / output unit: an element of
̂⃗y = WL ⃗h L ⃗h L = ψ(WL−1 ⃗h L−1)⃗h L−1 = ψ(WL−2 ⃗h L−2)⃗h 1 = ψ(W0 ⃗x )
̂⃗y ⃗h l ⃗x
⃗x
⃗h l̂⃗y
38
Credit: Russel & Norvig
wkj
z1
wji
z2 zk zc
... ...
... ...
... ...
... ...
x1 x2 xi xd... ...
output z
x1 x2 xi xd
y1 y2 yj ynH
t1 t2 tk tctarget t
input x
output
hidden
input
FIGURE 6.4. A d-nH-c fully connected three-layer network and the notation we shall
use. During feedforward operation, a d-dimensional input pattern x is presented to the
input layer; each input unit then emits its corresponding component xi. Each of the nH
hidden units computes its net activation, netj, as the inner product of the input layer sig-
nals with weights wji at the hidden unit. The hidden unit emits yj = f (netj), where f (·)
is the nonlinear activation function, shown here as a sigmoid. Each of the c output units
functions in the same manner as the hidden units do, computing netk as the inner prod-
uct of the hidden unit signals and weights at the output unit. The final signals emitted by
the network, zk = f (netk), are used as discriminant functions for classification. During
network training, these output signals are compared with a teaching or target vector t,
and any difference is used in training the weights throughout the network. From: Richard
O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001
by John Wiley & Sons, Inc.
Represents a network of neurons
Activation Functions
Sigmoid:

Hyperbolic tangent (Tanh):

Softplus:

Most common:  
ReLU (stands for rectified linear unit):
g(z) = 1
1 + exp(−z)
:= σ(z)
g(z) = tanh(z)
g(z) = log(1 + exp(z))
g(z) = max(0,z)
39
Activation Functions
Linear:

But this is not a common choice of activation function. Why?

, where , , …, and .

Let denote the first columns of and denote the last column of .

Define and .

This becomes: , which is just a linear regression model.

As a result, the activation function is also commonly known as the non-linearity.
g(z) = z
̂⃗y = WL ⃗h L ⃗h L = (WL−1 ⃗h L−11 ) ⃗h L−1 = (WL−2
⃗h L−2
1 ) ⃗h 1 = (
W0 ⃗x
1 )
W˜l nl Wl ⃗b l Wl
̂⃗y = W˜LW˜L−1⋯W˜0 ⃗x + ( ⃗b L + W˜L ⃗b L−1 + W˜L⋯W˜1 ⃗b 0)
W˜ = W˜LW˜L−1⋯W˜0 ⃗b = ⃗b L + W˜L ⃗b L−1 + W˜L⋯W˜1 ⃗b 0
̂⃗y = W˜ ⃗x + ⃗b
40
Model Summary
Model:

Parameters: , , .…, and

Loss function:

(For now let’s consider the multiple output regression loss)

How can we find the optimal parameters?

As we saw with non-linear least squares, because the model is non-linear in the parameters, we
won’t be able to solve for the optimal parameters analytically. Need to use iterative optimization
algorithms.
̂⃗y := f( ⃗x ; {Wl}Ll=0) = WLψ(WL−1ψ(WL−2⋯ψ(W0 ⃗x )))
WL WL−1 W0
L({Wl}Ll=0) =
N
∑
i=1
∥ ⃗y i − ̂⃗y i∥22 =
N
∑
i=1
∥ ⃗y i − f( ⃗x i; {Wl}Ll=0)∥22
41
Universal Function Approximation
For any continuous function on a closed and
bounded set, a multi-layer perceptron with one
hidden layer and sufficiently many hidden units
on that layer can approximate it to any arbitrary
precision, for most choices of activation
functions*.

*must be nonlinear, bounded, nondecreasing,
and continuous

Because of this property, neural networks are
universal function approximators.

42
-5 -2.5 0 2.5 5
-3
-2
-1
1
2
3
Universal Function Approximation
Example with ReLU activation function:

:

:

g(z) = max(0,z)
g(z + 1/2)
−g(z − 1/2)
43
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
1
2
3
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
1
2
3
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
1
2
3
:g(z + 1/2) − g(z − 1/2)
Universal Function Approximation
44
:g(z + 1/2) − g(z − 1/2) + a :g((z − b) + 1/2) − g((z − b) − 1/2)
Universal Function Approximation
45
:g(cz + 1/2) − g(cz − 1/2) :g(z +
1
2d
) − g(z − 1
2d
)
Universal Function Approximation
46
-5 -2.5 0 2.5 5
-3
-2
-1
1
2
3
:g(c(z − b) + 1
2d
) − g(c(z − b) − 1
2d
) + a :(∑i g(ci(z − bi) +
1
2d
) − g(ci(z − bi) −
1
2d
) + ai) + α
Universal Function Approximation

So this is indeed a multi-layer perceptron with one hidden layer.
f(z) = (∑i g(ci(z − bi) +
1
2d
) − g(ci(z − bi) −
1
2d
) + ai) + α
= (1 −1 ⋯ 1 −1 (∑i ai) + α)ψ
c1 −b1c1 +
1
2d
c1 −b1c1 −
1
2d
⋮ ⋮
cn −bncn +
1
2d
cn −bncn −
1
2d
(z1)
47
Training Neural Networks
Flatten the parameters and the gradients into vectors and , and then apply an
iterative optimization algorithm like gradient descent.

Gradient Descent:
 
 
 
 

How to compute the gradients ?
{Wl}Ll=0 { ∂L∂Wl}
L
l=0
⃗θ ∂L
∂ ⃗θ
⃗θ (0) ← random vector
for t = 1,2,3,…
⃗θ (t) ← ⃗θ (t−1) − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
{ ∂L∂Wl}
L
l=0
48
In-Class Quiz
Q2: Which of the following facts about neural networks is NOT true?
(A) MLPs are a generalization of the linear regression model
(B) In general, MLPs are trained using iterative optimization algorithms
(C) For MLPs to be capable of approximating any continuous function, the
activation function must be non-linear
(D) A linear regression model is an MLP without any hidden layers
(E) An MLP without biases can be a universal function approximator
(F) All are true
(G) Don’t know
49
Multivariate Chain Rule
We have two functions and , consider the function .

If is differentiable in and , and and are both differentiable
in and ,

Example:

v(x, y) w(x, y) u(v(x, y),w(x, y))
u(v,w) v w v(x, y) w(x, y)
x y
∂u
∂x
= ∂u
∂v
∂v
∂x
+ ∂u
∂w
∂w
∂x
∂L
∂Wl
= ∂
∂Wl
N
∑
i=1
∥ ⃗y i − ̂⃗y i∥22 =
N
∑
i=1
∂
∂Wl
∥ ⃗y i − ̂⃗y i∥22 =
N
∑
i=1
nL+1
∑
j=1
2(yi,j − ̂y i,j)
∂ ̂y i,j
∂Wl
50
Chain Rule for Matrix Derivatives
We can define and , consider the function .

If is differentiable in , and is differentiable in ,

In general, if the we have vector-valued functions and
and a vector ,

⃗x = (xy) ⃗v ( ⃗x ) = (v( ⃗x )w( ⃗x )) u( ⃗v ( ⃗x ))
u ⃗v ⃗v ⃗x
∂u
∂ ⃗x =
∂ ⃗v
∂ ⃗x
∂u
∂ ⃗v
⃗u : ℝm → ℝn ⃗v : ℝl → ℝm
⃗x ∈ ℝl
∂ ⃗u
∂ ⃗x =
∂ ⃗v
∂ ⃗x
∂ ⃗u
∂ ⃗v
51
Backpropagation
Backpropagation is an efficient algorithm for
computing gradients.

Simple example:

Let . Consider
.

Naïve approach: Compute partial derivatives
of the output w.r.t. the input variables and
evaluate any unknown partial derivatives that
arise.

(Also known as forward mode
differentiation)

⃗x = (x1 x2 x3 x4)⊤
y = (x1 + x2)
z1
(x3 + x4)
z2
∂y
∂x1
= ∂y
∂z1
∂z1
∂x1
+ ∂y
∂z2
∂z2
∂x1
= ∂y
∂z1
⋅ 1 + ∂y
∂z2
⋅ 0
∂y
∂z1
= z2 = x3 + x4
∂y
∂x2
= ∂y
∂z1
∂z1
∂x2
+ ∂y
∂z2
∂z2
∂x2
= ∂y
∂z1
⋅ 1 + ∂y
∂z2
⋅ 0
∂y
∂z1
= z2 = x3 + x4
52
Backpropagation
Backpropagation is an efficient algorithm for
computing gradients.

Simple example:

Let . Consider
.

Naïve approach: Compute partial derivatives
of the output w.r.t. the input variables and
evaluate any unknown partial derivatives that
arise.

(Also known as forward mode
differentiation)

⃗x = (x1 x2 x3 x4)⊤
y = (x1 + x2)
z1
(x3 + x4)
z2
∂y
∂x3
= ∂y
∂z1
∂z1
∂x3
+ ∂y
∂z2
∂z2
∂x3
= ∂y
∂z1
⋅ 0 + ∂y
∂z2
⋅ 1
∂y
∂z2
= z1 = x1 + x2
∂y
∂x4
= ∂y
∂z1
∂z1
∂x4
+ ∂y
∂z2
∂z2
∂x4
= ∂y
∂z1
⋅ 0 + ∂y
∂z2
⋅ 1
∂y
∂z2
= z1 = x1 + x2
53
Need to perform 4 additions
Backpropagation
Backpropagation is an efficient algorithm for
computing gradients.

Simple example:

Let . Consider
.

Backpropagation: Compute partial derivatives
of the output w.r.t. intermediate variables
before computing the partial derivatives of the
output w.r.t. the input variables. This is a form
of dynamic programming.

(Also known as reverse mode differentiation)

,

⃗x = (x1 x2 x3 x4)⊤
y = (x1 + x2)
z1
(x3 + x4)
z2
∂y
∂z1
= z2 = x1 + x2
∂y
∂z2
= z1 = x3 + x4
∂y
∂x1
= ∂y
∂z1
∂z1
∂x1
+ ∂y
∂z2
∂z2
∂x1
= z2 ⋅ 1 + z1 ⋅ 0 = z2
∂y
∂x2
= ∂y
∂z1
∂z1
∂x2
+ ∂y
∂z2
∂z2
∂x2
= z2 ⋅ 1 + z1 ⋅ 0 = z2
∂y
∂x3
= ∂y
∂z1
∂z1
∂x3
+ ∂y
∂z2
∂z2
∂x3
= z2 ⋅ 0 + z1 ⋅ 1 = z1
∂y
∂x4
= ∂y
∂z1
∂z1
∂x4
+ ∂y
∂z2
∂z2
∂x4
= z2 ⋅ 0 + z1 ⋅ 1 = z1
54
Need to perform 2 additions
Computation Graph
We can represent the function as a computation
graph, where each intermediate variable and
operator is represented as a node:

E.g.: , where and .

Backpropagation consists of two stages:

Forward pass: Computing the values of all
variables, starting from the children and ending at
the ancestor.

Backward pass: Computing the partial derivatives
of the ancestor w.r.t. each variable, starting from the
ancestor and ending at the children.
y = z1z2 z1 = x1 + x2 z2 = x3 + x4
55
!!
!"
+
!#
!$
+
×$!
$"
%
Forward pass
Backward pass
Backpropagation in Neural Networks
Recall: Multi-layer perceptron

, where , , …,
and . Let be an arbitrary loss function.

Forward pass: Compute the values of all intermediate variables (i.e. pre-
and post-activations at each hidden layer).

…

̂⃗y = WL ⃗h L ⃗h L = ψ( ⃗zL) ⃗zL = WL−1 ⃗h L−1 ⃗h 1 = ψ( ⃗z1)
⃗z1 = W0 ⃗x L( ̂⃗y )
⃗z1 = W0 ⃗x
⃗h 1 = ψ( ⃗z1)
⃗z2 = W1 ⃗h 1
⃗h 2 = ψ( ⃗z2)
56
backprop with new nota/on
wkj
ω1
... ...
ω2 ω3 ωk ωc
output
hidden
input
wij
δ1 δ2 δ3 δk δc
δj
FIGURE 6.5. The sensitivity at a hidden unit is proportional to the weighted sum of the
sensitivities at the output units: δj = f ′(netj)∑ck=1 wkjδk . The output unit sensitivities are
thus propagated “back” to the hidden units. From: Richard O. Duda, Peter E. Hart, and
David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
wj3
Backpropagation in Neural Networks
Recall: Multi-layer perceptron

, where , , …,
and . Let be an arbitrary loss function.

Forward pass: Compute the values of all intermediate variables (i.e. pre-
and post-activations at each hidden layer).

…

̂⃗y = WL ⃗h L ⃗h L = ψ( ⃗zL) ⃗zL = WL−1 ⃗h L−1 ⃗h 1 = ψ( ⃗z1)
⃗z1 = W0 ⃗x L( ̂⃗y )
⃗zL = WL−1 ⃗h L−1
⃗h L = ψ( ⃗zL)
̂⃗y = WL ⃗h L
57
backprop with new nota/on
wkj
ω1
... ...
ω2 ω3 ωk ωc
output
hidden
input
wij
δ1 δ2 δ3 δk δc
δj
FIGURE 6.5. The sensitivity at a hidden unit is proportional to the weighted sum of the
sensitivities at the output units: δj = f ′(netj)∑ck=1 wkjδk . The output unit sensitivities are
thus propagated “back” to the hidden units. From: Richard O. Duda, Peter E. Hart, and
David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
wj3
Backpropagation in Neural Networks
Recall: Multi-layer perceptron

, where , , …,
and . Let be an arbitrary loss function.

Backward pass: Eventual goal is to compute gradient of the loss
function w.r.t. first layer weights and biases. We do so by computing
gradients of the loss function w.r.t. all later layers first, starting from the
last layer.

Partial derivatives of an output neuron w.r.t. post-activations of the last
hidden layer:

, so

 
 
̂⃗y = WL ⃗h L ⃗h L = ψ( ⃗zL) ⃗zL = WL−1 ⃗h L−1 ⃗h 1 = ψ( ⃗z1)
⃗z1 = W0 ⃗x L( ̂⃗y )
̂⃗y = WL ⃗h L ∂L
∂ ⃗h L
= ∂
̂⃗y
∂ ⃗h L
∂L
∂ ̂⃗y = W
⊤
L
∂L
∂ ̂⃗y
58
backprop with new nota/on
wkj
ω1
... ...
ω2 ω3 ωk ωc
output
hidden
input
wij
δ1 δ2 δ3 δk δc
δj
FIGURE 6.5. The sensitivity at a hidden unit is proportional to the weighted sum of the
sensitivities at the output units: δj = f ′(netj)∑ck=1 wkjδk . The output unit sensitivities are
thus propagated “back” to the hidden units. From: Richard O. Duda, Peter E. Hart, and
David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
wj3
Backpropagation in Neural Networks
Recall: Multi-layer perceptron

, where , , …,
and . Let be an arbitrary loss function.

Partial derivatives of the post-activations of a hidden layer w.r.t. pre-
activations:

Partial derivatives of the pre-activations of a hidden layer w.r.t. the
post-activations of the preceding layer:

, so
̂⃗y = WL ⃗h L ⃗h L = ψ( ⃗zL) ⃗zL = WL−1 ⃗h L−1 ⃗h 1 = ψ( ⃗z1)
⃗z1 = W0 ⃗x L( ̂⃗y )
∂L
∂ ⃗zl
=
∂ ⃗h l
∂ ⃗zl
∂L
∂ ⃗h l
=
g′ (zl,1) 0 ⋯ 0
0 g′ (zl,2) ⋯ 0
0 0 ⋯ g′ (zl,nl)
∂L
∂ ⃗h l
⃗zl+1 = Wl ⃗h l ∂L
∂ ⃗h l
=
∂ ⃗zl+1
∂ ⃗h l
∂L
∂ ⃗zl+1
= W⊤l
∂L
⃗zl+1
59
backprop with new nota/on
wkj
ω1
... ...
ω2 ω3 ωk ωc
output
hidden
input
wij
δ1 δ2 δ3 δk δc
δj
FIGURE 6.5. The sensitivity at a hidden unit is proportional to the weighted sum of the
sensitivities at the output units: δj = f ′(netj)∑ck=1 wkjδk . The output unit sensitivities are
thus propagated “back” to the hidden units. From: Richard O. Duda, Peter E. Hart, and
David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
wj3
Backpropagation in Neural Networks
Recall: Multi-layer perceptron

, where , , …,
and . Let be an arbitrary loss function.

Partial derivatives of the post-activations of a hidden layer w.r.t. pre-
activations:

Partial derivatives of the pre-activations of a hidden layer w.r.t. the
post-activations of the preceding layer:

, so
̂⃗y = WL ⃗h L ⃗h L = ψ( ⃗zL) ⃗zL = WL−1 ⃗h L−1 ⃗h 1 = ψ( ⃗z1)
⃗z1 = W0 ⃗x L( ̂⃗y )
∂L
∂ ⃗zl
=
∂ ⃗h l
∂ ⃗zl
∂L
∂ ⃗h l
=
g′ (zl,1) 0 ⋯ 0
0 g′ (zl,2) ⋯ 0
0 0 ⋯ g′ (zl,nl)
∂L
∂ ⃗h l
⃗zl+1 = Wl ⃗h l ∂L
∂ ⃗h l
=
∂ ⃗zl+1
∂ ⃗h l
∂L
∂ ⃗zl+1
= W⊤l
∂L
⃗zl+1
60
backprop with new nota/on
wkj
ω1
... ...
ω2 ω3 ωk ωc
output
hidden
input
wij
δ1 δ2 δ3 δk δc
δj
FIGURE 6.5. The sensitivity at a hidden unit is proportional to the weighted sum of the
sensitivities at the output units: δj = f ′(netj)∑ck=1 wkjδk . The output unit sensitivities are
thus propagated “back” to the hidden units. From: Richard O. Duda, Peter E. Hart, and
David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
wj3
Backpropagation in Neural Networks
Recall: Multi-layer perceptron

, where , , …, and
. Let be an arbitrary loss function.

Partial derivatives of the pre-activation of a hidden neuron w.r.t. the weights
between that layer and the preceding layer:

 
, so and for

̂⃗y = WL ⃗h L ⃗h L = ψ( ⃗zL) ⃗zL = WL−1 ⃗h L−1 ⃗h 1 = ψ( ⃗z1)
⃗z1 = W0 ⃗x L( ̂⃗y )
∂L
∂ ⃗w⊤l,j,⋅
=
∂ ⃗zl+1
∂ ⃗w⊤l,j,⋅
∂L
∂ ⃗zl+1
zl+1,j = ⃗w⊤l,j,⋅ ⃗h l
∂zl+1,j
∂ ⃗w⊤l,j,⋅
= ⃗h l
∂zl+1,j
∂ ⃗w⊤l,j′ ,⋅
= ⃗0 j′ ≠ j
∂ ⃗zl+1
∂ ⃗w⊤l,j,⋅
= ( ⃗0 ⋯ ⃗0 ⃗h l ⃗0 ⋯ ⃗0 )
∂L
∂ ⃗w⊤l,j,⋅
=
∂ ⃗zl+1
∂ ⃗w⊤l,j,⋅
∂L
∂ ⃗zl+1
= ( ⃗0 ⋯ ⃗0 ⃗h l ⃗0 ⋯ ⃗0 ) ∂L⃗zl+1
61
backprop with new nota/on
wkj
ω1
... ...
ω2 ω3 ωk ωc
output
hidden
input
wij
δ1 δ2 δ3 δk δc
δj
FIGURE 6.5. The sensitivity at a hidden unit is proportional to the weighted sum of the
sensitivities at the output units: δj = f ′(netj)∑ck=1 wkjδk . The output unit sensitivities are
thus propagated “back” to the hidden units. From: Richard O. Duda, Peter E. Hart, and
David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
wj3
jth column