COMP8536-comp8536代写|学霸联盟

COMP8536-comp8536代写

时间：2024-07-29

COMP8536: Advanced Topics in
Deep Learning for Computer Vision
Week 2 — Linear Classifiers and Multilayer Perceptrons
Stephen Gould
stephen.gould@anu.edu.au
Australian National University
Semester 2, 2024
COMP8536: Week 2 1/31
Overview
▶ binary classification
▶ logistic regression
▶ the XOR problem
▶ multilayer perceptron
▶ activation functions
▶ the universal approximation theorem
▶ multi-class classification
▶ loss functions
▶ gradient descent optimization
▶ calculating gradients
COMP8536: Week 2 2/31
Binary (linear) Classification
▶ data: training examples D = {(x(i), y(i))}N
i=1
composed of
▶ features x(i) ∈ Rn
▶ target labels y(i) ∈ {0, 1}
▶ regression task: learn a density function P (y | x) from the training examples,
which estimates the probability of 0 or 1 on an input x
▶ classifier task: learn a linear classifier f(x) = aTx+ b from the training
examples, which predicts 1 on an input x if aTx+ b ≥ 0 and 0 otherwise
COMP8536: Week 2 3/31
2D Illustration
separable case:
a T
x
+
b
=
0
a
inseparable case:
▶ aTx+ b measures (signed) distance from the hyperplane
COMP8536: Week 2 4/31
Logistic Regression
Let y ∈ {0, 1} be a random variable with distribution
P (y = 1 | x) = exp(a
Tx+ b)
1 + exp(aTx+ b)
where a, b are parameters and x ∈ Rn is observed features.
The maximum likelihood principle states that we should choose parameters a, b that
maximizes the probability of the observed data under the model,
P (D) =
N∏
i=1
P
(
y(i) | x(i)
)
In practice we (equivalently) minimize the negative log-likelihood function instead,
ℓ(a, b;D) = − log
 ∏
i:y(i)=1
exp(aTx(i) + b)
1 + exp(aTx(i) + b)
∏
i:y(i)=0
1
1 + exp(aTx(i) + b)

= −
∑
i:y(i)=1
(aTx(i) + b) +
N∑
i=1
log
(
1 + exp(aTx(i) + b)
)
which is convex in a and b.
COMP8536: Week 2 5/31
Logistic Function
y = σ(z) =
exp(z)
exp(z) + 1
=
1
1 + exp(−z)
0.5
1
z = aTx+ b
y
▶ observe that σ : R→ [0, 1] and σ(z) ≥ 12 ⇐⇒ z = aTx+ b ≥ 0
COMP8536: Week 2 6/31
Logistic Function (2D Feature Space)
y =
1
1 + exp(−aTx− b)
0
0.5
1
x1
x2
y
COMP8536: Week 2 7/31
2D Illustration Revisited
a
a
T x
+
b
a
T x
+
b
COMP8536: Week 2 8/31
Regularization
▶ we often don’t want our classifiers to be too
confident in the margin between positive and
negative samples
▶ regularization is one way to reduce confidence in
that region and generalize better to unseen data
▶ add a penalty on the magnitude of parameters
minimizea,b ℓ(a, b;D) + λ
2
∥a∥2
▶ also called weight decay
∇a
(
ℓ(a, b;D) + λ
2
∥a∥2
)
= ∇aℓ+ λa
▶ we’ll see other techniques for preventing over fitting
later, e.g., data augmentation
a
aT x+ b
COMP8536: Week 2 9/31
Temperature Scaling
Introduce temperature parameter τ > 0,
y = σ
(z
τ
)
=
1
1 + exp (−z/τ)
z = aTx+ b
y
no scaling
low τ
high τ
▶ high τ has similar affect to regularization (restricted to logistic and not trained)
▶ can be applied after training to sharpen/flatten distribution
COMP8536: Week 2 10/31
Logistic Regression as a Single Neuron
classic view, y = σ (
∑n
i=1 aixi + b)
x1
x2
...
xn
Σ σ(·)
a1
a2
an
b
y
more modern view, y = σ(aTx+ b)
x aTx+ b σ(·)z y
a, b
COMP8536: Week 2 11/31
XOR Problem
▶ cannot be separated by a (single) linear decision boundary
▶ but can be separated by an additional layer of processing
COMP8536: Week 2 12/31
XOR Problem: Two Layers
NAND OR
AND
(the first layer defines a new features space; here we are visualizing in the original feature space)
COMP8536: Week 2 13/31
Multilayer Perceptron
(Rosenblatt 1961, Widrow & Hoff 1960)
Compose multiple logistic layers together. E.g., for two-layers,
y = σ(cTσ(Ax+ b) + d)
where σ is applied elementwise. Here A ∈ Rm×n, b ∈ Rm, c ∈ Rm, and d ∈ R.
Ax+ b σ(·) cTx+ d σ(·)x z1 z2 z3 y
A, b c, d
n m m 1 1
COMP8536: Week 2 14/31
Activation Functions
▶ logistic function (sigmoid)
σ(z) =
1
1 + e−z
▶ hyperbolic tangent
σ(z) = tanh(z) =
e2z − 1
e2z + 1
▶ rectified linear unit (ReLU)
σ(z) = max{0, z}
▶ and many others (e.g., Leaky ReLU, GLU, etc.)
−10 0 10
0
0.5
1
−10 0 10
−1
−0.5
0
0.5
1
−10 0 10
0
5
10
COMP8536: Week 2 15/31
Universal Approximation Theorem
(Cybenko 1989, Hornik 1991)
Theorem (roughly):
▶ let σ : R→ R be a (well-behaved) activation function
▶ let f : Rn → R be any continuous function
▶ then there exists parameters A ∈ Rm×n, b ∈ Rm, c ∈ Rm, and d ∈ R such that
the function
fˆ(x) = cTσ(Ax+ b) + d
approximates f(x) everywhere
▶ that is, |fˆ(x)− f(x)| ≤ ϵ for all x
COMP8536: Week 2 16/31
Proof Sketch (1D)
x
f(x)
COMP8536: Week 2 17/31
Why Deep Learning?
▶ so we can approximate any function with a two-layer network
▶ problem 1: doesn’t tell us how many parameters
▶ still active area of research, but indication is that we need O(e1/ϵ), in general
▶ problem 2: doesn’t tell us how to find the parameters efficiently
▶ we can reduce the number of parameters by going deeper
▶ learning turns out to be “easier” in deeper networks
▶ currently not well understood
▶ in practice we choose activation functions that violate the conditions of the
theorem, e.g., ReLU
COMP8536: Week 2 18/31
Advanced Topic: Identifiablity
▶ consider the two-layer perceptron (w/ or w/o second activation function),
Ax+ b σ(·) cTx+ d σ(·)x y
A, b c, d
▶ let P ∈ Rm×m be a permutation matrix, then
y = cTσ(Ax+ b) + d
= cTσ(P−1PAx+ P−1Pb) + d (since P−1P = I)
= cTP−1σ(PAx+ Pb) + d (since σ is applied elementwise)
= c˜Tσ(A˜x+ b˜) + d
▶ so neural network parameters are never unique
COMP8536: Week 2 19/31
Multi-class Classification
▶ task: K-way classification (e.g., ImageNet)
▶ data: training examples D = {(x(i), y(i))}N
i=1
composed of
▶ features x(i) ∈ Rn
▶ target labels y(i) ∈ {1, . . . ,K}
▶ we can extend the logistic function to the multi-class logistic
P (y = k | x) = exp(a
T
k x+ bk)∑K
j=1 exp(a
T
j x+ bj)
for k = 1, . . . ,K
parametrized by {(ak, bk) | k = 1, . . . ,K}
▶ the function that takes vector z = (z1, . . . , zK) and computes
( exp z1
Z , . . . ,
exp zK
Z
)
with Z =
∑K
k=1 exp zk is called softmax; the zk are called logits
▶ can be computed in a numerically stable way (how?)
COMP8536: Week 2 20/31
Multi-label Classification
▶ task: set of K binary classification problems (e.g., attribute classification)
▶ data: training examples D = {(x(i), y(i))}N
i=1
composed of
▶ features x(i) ∈ Rn
▶ target labels y(i) ∈ {0, 1}K
▶ model with a set of K sigmoid functions
P (yk = 1 | x) = exp(a
T
k x+ bk)
1 + exp(aTk x+ bk)
for k = 1, . . . ,K
parametrized by {(ak, bk) | k = 1, . . . ,K}
COMP8536: Week 2 21/31
As Multilayer Perceptrons
with parameters A ∈ Rm×n, b ∈ Rm, C ∈ RK×m, and d ∈ RK ,
ymulti-class = softmax(Cσ(Ax+ b) + d)
ymulti-label = sigmoid(Cσ(Ax+ b) + d)
Ax+ b σ(·) Cx+ d softmax(·)x z1 z2 z3 y
A, b C, d
n m m K K
Ax+ b σ(·) Cx+ d sigmoid(·)x z1 z2 z3 y
A, b C, d
n m m K K
COMP8536: Week 2 22/31
Common Loss Functions
▶ loss functions tell the optimizer what to do
▶ a surrogate for what we really care about
▶ typically decompose over training examples, (x, y) ∼ D
Mean Square Error. standard loss for regression, 12∥fθ(x)− y∥2
0-1 Loss. what we care about for classification, 0 if [[fθ(x) ≥ 0]] = y and 1 otherwise
but 0-1 loss is non-differentiable so difficult to optimize
Negative Log-likelihood. standard loss for classification, − log pθ(y | x)
Cross Entropy. negative log-likelihood loss where pθ is the multi-class logistic,
pθ ∝ exp(fθ(x))
Binary Cross Entropy. negative log-likelihood loss for K independent binary variables,
−
K∑
k=1
yk log pθ,k(1 | x) + (1− yk) log pθ,k(0 | x)
COMP8536: Week 2 23/31
Gradient Descent Optimization
x
f(x)
▶ function decreases in the negative gradient direction
COMP8536: Week 2 24/31
Multi-variate Calculus Review: Gradients
Let f : Rn → R be some function, fix some x ∈ Rn, and consider the expression
lim
α→0
f(x+ αei)− f(x)
α
where ei = (0, . . . , 1, 0, . . .) is the i-th canonical vector.
If the limit exists, it’s called the i-th partial derivative of f at x and denoted by ∂f(x)∂xi .
The gradient of f(x) with respect to x ∈ Rn is the vector of partial derivatives
∇xf(x) =

∂f
∂x1
...
∂f
∂xn

COMP8536: Week 2 25/31
Gradient Descent Algorithm
θ(t+1) = θ(t) − η∇ℓ(θ(t))
▶ η is the (iteration dependent) step size or step length or learning rate
▶ ideally chosen by line search (expensive)
▶ instead, in deep learning, chosen by fixed schedule
1 def gradient_descent(loss , theta0 ):
2 """ Generic gradient descent method."""
3
4 theta = theta0
5 for t in range(max_iters ):
6 dtheta = -1.0 * gradient(loss , theta) # descent direction is negative of gradient at theta
7 eta = line_search(loss , theta , dtheta) # choose step size by line search
8 theta = theta + eta * dtheta # update
9
10 if (converged ()): # check stopping criteria
11 break
12
13 return theta
▶ stopping criterion usually of the form ∥∇ℓ(θ)∥2 ≤ ϵ
▶ very simple—just need gradient of ℓ with respect to θ—but can be very slow
COMP8536: Week 2 26/31
(Vanilla) Gradient Descent Optimization in PyTorch
1 dataloader = ... # way of loading batches of input -label pairs for training
2 model = ... # definition of our prediction function
3
4 criterion = torch.nn.CrossEntropyLoss(reduction=’mean’)
5 optimizer = torch.optim.SGD(model.parameters (), lr =0.01)
6
7 for epoch in range(max_epochs ):
8 for iteration , batch in enumerate(dataloader , 0):
9 inputs , labels = batch
10
11 # zero gradient buffers
12 optimizer.zero_grad ()
13
14 # compute model outputs for given inputs
15 outputs = model(inputs)
16
17 # compute and print the loss
18 loss = criterion(outputs , labels)
19 print(loss.item ())
20
21 # compute gradient of the loss wrt model parameters
22 loss.backward ()
23
24 # take a gradient step
25 optimizer.step()
COMP8536: Week 2 27/31
Example Quadratic Problem in R2
ℓ(θ) =
1
2
(
θ21 + γθ
2
2
)
(γ > 0)
with exact line search, starting at θ(0) = (γ, 1):
θ
(t)
1 = γ
(
γ − 1
γ + 1
)t
, θ
(t)
2 =
(
−γ − 1
γ + 1
)t
θinit
θ(1)
▶ very slow if γ ≫ 1 or γ ≪ 1
▶ example above is for γ = 10
COMP8536: Week 2 28/31
Calculating Gradients: Worked Example
▶ two-layer perceptron with sigmoid activation
z = σ(Ax+ b)
y = σ(cT z + d)
where σ(ξ) = (1 + e−ξ)−1 is applied elementwise
▶ arbitrary loss function to minimize ℓ
▶ for gradient descent we need dℓdA ,
dℓ
db , etc.
▶ observe that
dσ
dξ
= (−e−ξ)(−1)(1 + e−ξ)−2
=
(
1
1 + e−ξ
)(
e−ξ
1 + e−ξ
)
= σ(ξ)(1− σ(ξ))
COMP8536: Week 2 29/31
Calculating Gradients: Worked Example (cont.)
▶ let ξ = cT z + d
▶ layer 2 parameters:
∂ℓ
∂d
=
dℓ
dy
∂y
∂d
=
dℓ
dy
dy
dξ
∂ξ
∂d
=
dℓ
dy
y(1− y)
∇cℓ = dℓ
dy
∇cy
=
dℓ
dy
dy
dξ
∇cξ
=
dℓ
dy
y(1− y)z
▶ where the last line is since y = σ(ξ)
COMP8536: Week 2 30/31
Calculating Gradients: Worked Example (cont.)
▶ layer 2 input / layer 1 output:
∇zℓ = dℓ
dy
y(1− y)c
▶ layer 1 parameters:
∇bℓ = ∇zℓ ◦ z ◦ (1− z)
∇Aℓ = dℓ
dy
y(1− y) (c ◦ z ◦ (1− z))xT
where ◦ denotes elementwise product
▶ note that we have reused calculations from evaluation of ℓ; we could have written
y and z in terms of x in the gradient expressions (memory-vs-compute trade-off)
COMP8536: Week 2 31/31