xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

CS 361-CS361程序代写案例

时间：2021-04-29

CS 361: Probability and Statistics for Computer Science (Spring 2021)

Stochastic Optimization Project

1 Stochastic Optimization Theory Part

1.1 A common stochastic optimization task

In many machine learning problems, we are trying to minimize function f(θ) in the following format.

f(θ) =

1

k

k∑

j=1

Q(θ, j) (1.1)

where Q(θ, j) is the loss function for jth data point where we have k training data points in the data set D. Here,

θ is the set of parameters of the model that we try to optimize for, that is we want to find θ∗ = arg minθ f(θ).

However, f is either unknown or very expensive to collect, but we have access to the noisy version g(θ). Using

g(θ) to find the optimized θ∗ for f(θ) would be a stochastic optimization task.

Exercise 1.

(4 points) Define the random variable g(θ) specifically here to be Q(θ, i), where i is a random variable that

is of uniform distribution in the integer set [1 : k]. So, ∇g(θ) is to be ∇Q(θ, i). Further, we define the noise

z = Q(θ, i)− f(θ), hence g(θ) = Q(θ, i) = f(θ) + z. That’s why we call g is the noisy version of f . Given this

definition of f and g, prove that equations E[g(θ)] = f(θ), E[∇g(θ)] = ∇f(θ), and E[z] = 0 are satisfied.

1.2 Review of GD and SGD

The goal of optimization is to find the θ∗ that minimizes f(θ), which could be more general than that in (1.1).

Again sometimes f is either unknown or very expensive to collect, but we have access to the noisy version g(θ).

The following are two approximation methods to find θ∗ iteratively.

1.2.1 Gradient Descent Algorithm (GD)

• Initialize θ1

• For t = 1, 2, · · · Then do the following update

θt+1 = θt − ηt∇f(θt) (1.2)

1.2.2 Stochastic Gradient Descent Algorithm (SGD)

• Initialize θ1

• For t = 1, 2, · · · . Obtain a noisy version of the gradient (i.e. ∇g(θt)). Then do the following update

θt+1 = θt − ηt∇g(θt) (1.3)

1.3 Convergence rates for SGD and GD

Suppose we want to minimize function f with learning rate sequence ηt, We have the following theorems:

Theorem 1.1 Assume ∇f has a unique root θ∗. If f is twice continuously differentiable and strongly convex,

and ηt = O(t

−1), then to reach an approximation error = |E[f(θt)] − f(θ∗)|, stochastic gradient descent

requires t = O( 1 ) updates.

1

– Stochastic Optimization Project 2

Theorem 1.2 Assume ∇f has a unique root θ∗. If either the smoothness assumptions about f in theorem 1.1

or ηt = O(t

−1) is not met, then to reach an approximation error = |E[f(θt)] − f(θ∗)|, stochastic gradient

descent requires t = O( 12 ) updates.

Theorem 1.3 Assume ∇f has a unique root θ∗. To reach an approximation error = |E[f(θt)] − f(θ∗)|,

gradient descent requires t = O(ln( 1 )) updates.

2 Comparing SGD vs GD for their running time

Let us consider the f minimization task discussed earlier in (1.1).

• f(θ) and g(θ) are as described in exercise 1, ∇f(θ∗) = 0.

• Let’s assume that f is strongly convex, and is twice continuously differentiable.

• We use the ηt = 1t learning rate sequence.

• We have k data points.

• We want to achieve a training precision of = |E[f(θt)]− f(θ∗)|.

• Assume finding ∇Q(θ, j) takes c time, and finding ∇f(θ) takes kc time.

Exercise 2.

(15 points) Fill in table 1, with complexity orders in terms of k and

Table 1 Comparing SGD vs GD in terms of training precision

SGD GD

Computational Cost per Update O(1)

Number of updates to reach

Total Cost

Explain your answer for each element of the second row by providing a reference to the theorem

mentioned in this document. For every other element, explain your answer.

3 Comparing SGD vs GD with respect to test loss

The last exercise was about finding the computational complexity required to reach a specific precision with

respect to the optimal training loss. However, a specific precision of training loss does not translate to the

same precision of test loss. Here are some helpful facts:

• To achieve a specific test loss precision ρ, you need a large enough number of training samples k. The

relation

ρ = O(k−γ)

must hold for most loss functions where 0 < γ ≤ 1 is an unknown constant.

• Let’s assume = Θ(ρ); for simplicity, you can assume that it is as if is c×ρ where c is a positive constant.

Exercise 3.

(16 points) 1) Fill in table 2, with complexity orders in terms of ρ and γ. You can refer to the previous table,

and replace the elements with appropriate values. Make sure you state the reason for each element even

if it looks obvious.

– Stochastic Optimization Project 3

Table 2 Comparing SGD vs GD in terms of testing precision

SGD GD

Computational Cost per Update O(1)

Number of updates to reach ρ

Total Cost

2) For a typical γ = 0.2, Explain why choosing GD over SGD can be very unwise.

CS 361: Probability and Statistics for Computer Science (Spring 2021)

Stochastic Optimization Implementation

Objective: Implement state-of-the-art stochastic optimization algorithms for Machine Learning Problems,

Linear Regression and Classification (using Neural Networks).

3.1 Adaptive Momentum Estimation (ADAM) Gradient Descent Algorithm

SGD can be ineffective when the function is highly non-convex. Unfortunately, most modern applications of

Machine Learning involve non-convex optimization problems. One stochastic optimization algorithm that works

well even for non-convexity is ADAM [KB14]. ADAM uses past gradient information to “speed” up optimization

by smoothing, and the algoirthm has been further improved [SSS18]. You will implement the ADAM stochastic

optimization algorithm for a Linear Regression problem.

The pseudo-code for ADAM has been reproduced here from this paper. Credits to [KB14].

Disclaimer: The textbook, in Chapter 13, uses β for parameters but we will be using θ.

Algorithm 1: g2t indicates the elementwise square gtgt. Good default settings for the tested machine learning

problems are α = 0.001, β1 = 0.9, β2 = 0.999 and = 10

−8. All operations on vectors are element-wise. With

βt1 and β

t

2, we denote β1 and β2 to the power t.

Require: α: Stepsize

Require: : Division-by-zero control parameter

Require: β1, β2 ∈ [0, 1): Exponential decay rates for the moment estimates

Require: f(θ): Stochastic objective function with parameters θ.

Require: θ0: Initial parameter vector

m0 ← 0 (Initialize 1st moment vector)

v0 ← 0 (Initialize 2nd moment vector)

t← 0 (Initialize timestep)

while θt not converged do

t← t+ 1

gt ← ∇θft(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)

mt ← β1 ·mt−1 + (1− β1) · gt (Update biased first moment estimate)

vt ← β2 · vt−1 + (1− β2) · g2t (Update biased second raw moment estimate)

mˆt ← mt/(1− βt1) (Compute bias-corrected first moment estimate).

vˆt ← vt/(1− βt2) (Compute bias-correct second raw moment estimate).

θˆt ← θt−1 − α · mˆt/(

√

vˆt + ) (Update parameters)

end while

return θt (Resulting parameters)

Exercise 4.

Consider the following problem setting:

• The number of training data points is k = 1000. The number of test data points is 50.

• Generation of the data set to fit for the model: the jth data point is represented by xj and is a 10-

dimensional vector. Each element of xj is being generated from a uniform distribution over the interval

[−1, 1].

• θtrue (i.e. the true parameter set) and θ (i.e. the variable indicating a candidate parameter set) are also

10-dimensional vectors. Assume that θtrue is all ones. However, we will pretend that we do not know it

and we want to estimate it using the training data.

• Data is being generated according to yj = xTj θtrue+0.1 ·ξ (where ξ ∼ N (0, 1) is a sample from the normal

distribution). The label yj is a scalar.

1

– Stochastic Optimization Implementation 2

• Let us assume that all the algorithm variants start from the same initial θ, whose elements are picked

uniformly in the interval [0, 0.1].

• Use 1000 updates.

Since this is an optimization problem, we need to define the loss function. Let’s use the following format

F (θ) =

1

k

k∑

j=1

Q(θ, j) (3.1)

Q(θ, j) =

∣∣∣∣xTj θ − yj∣∣∣∣γ (3.2)

Where γ is a hyper-parameter that we use to control and define the loss function.

For the following problems, state findings, provide analysis, and substantiate them in your write-up with screen-

shots of the plots from the Gradient Descent Python notebook code. Please save the notebook to your

own Google drive so that your work could be preserved after the runtime expires.

1. (5 points) For γ = 2, the problem becomes the least squares regression that you learned from Chapter

13. State the closed form solution (i.e. θ = ...) in your report, and then implement the solution to solve

for θ. Provide the results of the experiment and state whether it is close to the true value.

2. (5 points) For Q(θ, j), find an expression that gives you the gradient ∇Q(θ, j). Report this expression,

and implement it in the appropriate function.

hint: For r(θ) = h(e(θ)) = h(xTj θ − yj), the gradient can be written as ∇r(θ) = ∂h∂e · ∇e(θ) = ∂h∂e · xj

according to the chain rule.

hint: The sign function, sgn(x), may be useful.

3. (5 points) Code the ADAM optimization algorithm (with default hyper-parameters such as the learning

rate as mentioned in the pseudocode above) and SGD to find the best parameter θ. Use a batch size of 1

for this problem, and perform 1000 parameter updates. Report the final set of parameters.

4. (5 points) Update your code to compute the average test loss at every 50th update, and plot these values.

You might notice that the error bars of ADAM and SGD overlap. This is due to the inherent randomness

from sampling values. To avoid this probabilistic overlap, increase the number of replicates (num rep in

the starter code) until the error bars between ADAM and SGD do not overlap. Report this curve.

5. (9 points) Run ADAM method for each of γ = 0.4, 0.7, 1, 2, 3, and 5. Report your observations clearly,

and analyze the trends you are seeing. State whether ADAM consistently out performs SGD. Your analysis

should include the reason why one method outperforms the other under each setting.

Example Plot

Note: Your plots will probably not end up looking exactly like this one.

– Stochastic Optimization Implementation 3

3.2 Classifying Handwritten Digits with Neural Networks

Next, we’ll use Neural Networks to classify handwritten digits from MNIST dataset (dataset of handwritten

digits). The objective is to use different stochastic optimization algorithms that you have seen so far and

compare their performances (GD, SGD, ADAM). You will train the model and then classify handwritten digits.

Fun fact: One of the co-creators of MNIST dataset (Dr. Yann LeCun) is also the co-recicpient of 2018 ACM

Turing award for his work on Neural Networks.

Exercise 5.

For the following problems, please modify the Neural Networks Python notebook code.

1. To run the Gradient Descent (GD) Algorithm: Set the (mini) batch size to 60000 (the size of the MNIST

dataset). Run the SGD optimizer with a learning rate of 0.003.

(1 points) Why is this the same as the GD algorithm if we are using SGD optimizer?

(2 points) What do you observe? Report the accuracy.

(2 points) List at least 2 ways to improve the accuracy.

2. To run the Stochastic Gradient Descent (SGD) Algorithm: Set the (mini) batch size to 64. Run the SGD

optimizer with a learning rate of 0.003.

(2 points) What do you observe? Report the accuracy.

(1 points) List at least one way to improve accuracy further.

3. To run the ADAM algorithm: Set the (mini) batch size to 64. Run the ADAM optimizer with default

learning rates (Algorithm 1)

Hint: You can use Pytorch (or any other library in any language) for setting up the ADAM optimizer.

(2 points) What do you observe? Report the accuracy.

(1 points) Why does ADAM converge faster than SGD?

– Stochastic Optimization Implementation 4

References

[RM51] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of

mathematical statistics (1951), pp. 400–407.

[Sac58] Jerome Sacks. “Asymptotic distribution of stochastic approximation procedures”. In: The Annals

of Mathematical Statistics 29.2 (1958), pp. 373–405.

[NY83] Semenovich Nemirovsky Arkadi and David Borisovich Yudin. “Problem complexity and method

efficiency in optimization.” In: Wiley-Interscience series in discrete mathematics. (1983).

[CZ07] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint. Vol. 24.

Cambridge University Press, 2007.

[Nem+09] Arkadi Nemirovski et al. “Robust stochastic approximation approach to stochastic programming”.

In: SIAM Journal on optimization 19.4 (2009), pp. 1574–1609.

[Bot10] Le´on Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of

COMPSTAT’2010. Springer, 2010, pp. 177–186.

[Bot12] Le´on Bottou. “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade. Springer,

2012, pp. 421–436.

[DGL13] Luc Devroye, La´szlo´ Gyo¨rfi, and Ga´bor Lugosi. A probabilistic theory of pattern recognition. Vol. 31.

Springer Science & Business Media, 2013.

[JZ13] Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive variance

reduction”. In: Advances in neural information processing systems. 2013, pp. 315–323.

[Vap13] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.

[KB14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. arXiv:

1412.6980 [cs.LG].

[MV18] Pierre Moulin and Venugopal Veeravalli. Statistical inference for engineers and data scientists.

Cambridge University Press, 2018.

[SSS18] Sashank, Satyen, and Sanjiv. On the Convergence of Adam and Beyond. 2018.

Acknowledgements

Ehsan Saleh, Anay Pattanik created the first draft of the project outline.

Hongye Liu, Ajay Fewell, Chenhui Zhang, Muhammed Imran, Brian Yang, and Jinglin contributed to the newest

edition.

This document was compiled and inspired by the work and ideas shown in [RM51; Sac58; NY83; Nem+09;

Bot10; JZ13; Bot12; Vap13; DGL13; CZ07; MV18]

Stochastic Optimization Project

1 Stochastic Optimization Theory Part

1.1 A common stochastic optimization task

In many machine learning problems, we are trying to minimize function f(θ) in the following format.

f(θ) =

1

k

k∑

j=1

Q(θ, j) (1.1)

where Q(θ, j) is the loss function for jth data point where we have k training data points in the data set D. Here,

θ is the set of parameters of the model that we try to optimize for, that is we want to find θ∗ = arg minθ f(θ).

However, f is either unknown or very expensive to collect, but we have access to the noisy version g(θ). Using

g(θ) to find the optimized θ∗ for f(θ) would be a stochastic optimization task.

Exercise 1.

(4 points) Define the random variable g(θ) specifically here to be Q(θ, i), where i is a random variable that

is of uniform distribution in the integer set [1 : k]. So, ∇g(θ) is to be ∇Q(θ, i). Further, we define the noise

z = Q(θ, i)− f(θ), hence g(θ) = Q(θ, i) = f(θ) + z. That’s why we call g is the noisy version of f . Given this

definition of f and g, prove that equations E[g(θ)] = f(θ), E[∇g(θ)] = ∇f(θ), and E[z] = 0 are satisfied.

1.2 Review of GD and SGD

The goal of optimization is to find the θ∗ that minimizes f(θ), which could be more general than that in (1.1).

Again sometimes f is either unknown or very expensive to collect, but we have access to the noisy version g(θ).

The following are two approximation methods to find θ∗ iteratively.

1.2.1 Gradient Descent Algorithm (GD)

• Initialize θ1

• For t = 1, 2, · · · Then do the following update

θt+1 = θt − ηt∇f(θt) (1.2)

1.2.2 Stochastic Gradient Descent Algorithm (SGD)

• Initialize θ1

• For t = 1, 2, · · · . Obtain a noisy version of the gradient (i.e. ∇g(θt)). Then do the following update

θt+1 = θt − ηt∇g(θt) (1.3)

1.3 Convergence rates for SGD and GD

Suppose we want to minimize function f with learning rate sequence ηt, We have the following theorems:

Theorem 1.1 Assume ∇f has a unique root θ∗. If f is twice continuously differentiable and strongly convex,

and ηt = O(t

−1), then to reach an approximation error = |E[f(θt)] − f(θ∗)|, stochastic gradient descent

requires t = O( 1 ) updates.

1

– Stochastic Optimization Project 2

Theorem 1.2 Assume ∇f has a unique root θ∗. If either the smoothness assumptions about f in theorem 1.1

or ηt = O(t

−1) is not met, then to reach an approximation error = |E[f(θt)] − f(θ∗)|, stochastic gradient

descent requires t = O( 12 ) updates.

Theorem 1.3 Assume ∇f has a unique root θ∗. To reach an approximation error = |E[f(θt)] − f(θ∗)|,

gradient descent requires t = O(ln( 1 )) updates.

2 Comparing SGD vs GD for their running time

Let us consider the f minimization task discussed earlier in (1.1).

• f(θ) and g(θ) are as described in exercise 1, ∇f(θ∗) = 0.

• Let’s assume that f is strongly convex, and is twice continuously differentiable.

• We use the ηt = 1t learning rate sequence.

• We have k data points.

• We want to achieve a training precision of = |E[f(θt)]− f(θ∗)|.

• Assume finding ∇Q(θ, j) takes c time, and finding ∇f(θ) takes kc time.

Exercise 2.

(15 points) Fill in table 1, with complexity orders in terms of k and

Table 1 Comparing SGD vs GD in terms of training precision

SGD GD

Computational Cost per Update O(1)

Number of updates to reach

Total Cost

Explain your answer for each element of the second row by providing a reference to the theorem

mentioned in this document. For every other element, explain your answer.

3 Comparing SGD vs GD with respect to test loss

The last exercise was about finding the computational complexity required to reach a specific precision with

respect to the optimal training loss. However, a specific precision of training loss does not translate to the

same precision of test loss. Here are some helpful facts:

• To achieve a specific test loss precision ρ, you need a large enough number of training samples k. The

relation

ρ = O(k−γ)

must hold for most loss functions where 0 < γ ≤ 1 is an unknown constant.

• Let’s assume = Θ(ρ); for simplicity, you can assume that it is as if is c×ρ where c is a positive constant.

Exercise 3.

(16 points) 1) Fill in table 2, with complexity orders in terms of ρ and γ. You can refer to the previous table,

and replace the elements with appropriate values. Make sure you state the reason for each element even

if it looks obvious.

– Stochastic Optimization Project 3

Table 2 Comparing SGD vs GD in terms of testing precision

SGD GD

Computational Cost per Update O(1)

Number of updates to reach ρ

Total Cost

2) For a typical γ = 0.2, Explain why choosing GD over SGD can be very unwise.

CS 361: Probability and Statistics for Computer Science (Spring 2021)

Stochastic Optimization Implementation

Objective: Implement state-of-the-art stochastic optimization algorithms for Machine Learning Problems,

Linear Regression and Classification (using Neural Networks).

3.1 Adaptive Momentum Estimation (ADAM) Gradient Descent Algorithm

SGD can be ineffective when the function is highly non-convex. Unfortunately, most modern applications of

Machine Learning involve non-convex optimization problems. One stochastic optimization algorithm that works

well even for non-convexity is ADAM [KB14]. ADAM uses past gradient information to “speed” up optimization

by smoothing, and the algoirthm has been further improved [SSS18]. You will implement the ADAM stochastic

optimization algorithm for a Linear Regression problem.

The pseudo-code for ADAM has been reproduced here from this paper. Credits to [KB14].

Disclaimer: The textbook, in Chapter 13, uses β for parameters but we will be using θ.

Algorithm 1: g2t indicates the elementwise square gtgt. Good default settings for the tested machine learning

problems are α = 0.001, β1 = 0.9, β2 = 0.999 and = 10

−8. All operations on vectors are element-wise. With

βt1 and β

t

2, we denote β1 and β2 to the power t.

Require: α: Stepsize

Require: : Division-by-zero control parameter

Require: β1, β2 ∈ [0, 1): Exponential decay rates for the moment estimates

Require: f(θ): Stochastic objective function with parameters θ.

Require: θ0: Initial parameter vector

m0 ← 0 (Initialize 1st moment vector)

v0 ← 0 (Initialize 2nd moment vector)

t← 0 (Initialize timestep)

while θt not converged do

t← t+ 1

gt ← ∇θft(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)

mt ← β1 ·mt−1 + (1− β1) · gt (Update biased first moment estimate)

vt ← β2 · vt−1 + (1− β2) · g2t (Update biased second raw moment estimate)

mˆt ← mt/(1− βt1) (Compute bias-corrected first moment estimate).

vˆt ← vt/(1− βt2) (Compute bias-correct second raw moment estimate).

θˆt ← θt−1 − α · mˆt/(

√

vˆt + ) (Update parameters)

end while

return θt (Resulting parameters)

Exercise 4.

Consider the following problem setting:

• The number of training data points is k = 1000. The number of test data points is 50.

• Generation of the data set to fit for the model: the jth data point is represented by xj and is a 10-

dimensional vector. Each element of xj is being generated from a uniform distribution over the interval

[−1, 1].

• θtrue (i.e. the true parameter set) and θ (i.e. the variable indicating a candidate parameter set) are also

10-dimensional vectors. Assume that θtrue is all ones. However, we will pretend that we do not know it

and we want to estimate it using the training data.

• Data is being generated according to yj = xTj θtrue+0.1 ·ξ (where ξ ∼ N (0, 1) is a sample from the normal

distribution). The label yj is a scalar.

1

– Stochastic Optimization Implementation 2

• Let us assume that all the algorithm variants start from the same initial θ, whose elements are picked

uniformly in the interval [0, 0.1].

• Use 1000 updates.

Since this is an optimization problem, we need to define the loss function. Let’s use the following format

F (θ) =

1

k

k∑

j=1

Q(θ, j) (3.1)

Q(θ, j) =

∣∣∣∣xTj θ − yj∣∣∣∣γ (3.2)

Where γ is a hyper-parameter that we use to control and define the loss function.

For the following problems, state findings, provide analysis, and substantiate them in your write-up with screen-

shots of the plots from the Gradient Descent Python notebook code. Please save the notebook to your

own Google drive so that your work could be preserved after the runtime expires.

1. (5 points) For γ = 2, the problem becomes the least squares regression that you learned from Chapter

13. State the closed form solution (i.e. θ = ...) in your report, and then implement the solution to solve

for θ. Provide the results of the experiment and state whether it is close to the true value.

2. (5 points) For Q(θ, j), find an expression that gives you the gradient ∇Q(θ, j). Report this expression,

and implement it in the appropriate function.

hint: For r(θ) = h(e(θ)) = h(xTj θ − yj), the gradient can be written as ∇r(θ) = ∂h∂e · ∇e(θ) = ∂h∂e · xj

according to the chain rule.

hint: The sign function, sgn(x), may be useful.

3. (5 points) Code the ADAM optimization algorithm (with default hyper-parameters such as the learning

rate as mentioned in the pseudocode above) and SGD to find the best parameter θ. Use a batch size of 1

for this problem, and perform 1000 parameter updates. Report the final set of parameters.

4. (5 points) Update your code to compute the average test loss at every 50th update, and plot these values.

You might notice that the error bars of ADAM and SGD overlap. This is due to the inherent randomness

from sampling values. To avoid this probabilistic overlap, increase the number of replicates (num rep in

the starter code) until the error bars between ADAM and SGD do not overlap. Report this curve.

5. (9 points) Run ADAM method for each of γ = 0.4, 0.7, 1, 2, 3, and 5. Report your observations clearly,

and analyze the trends you are seeing. State whether ADAM consistently out performs SGD. Your analysis

should include the reason why one method outperforms the other under each setting.

Example Plot

Note: Your plots will probably not end up looking exactly like this one.

– Stochastic Optimization Implementation 3

3.2 Classifying Handwritten Digits with Neural Networks

Next, we’ll use Neural Networks to classify handwritten digits from MNIST dataset (dataset of handwritten

digits). The objective is to use different stochastic optimization algorithms that you have seen so far and

compare their performances (GD, SGD, ADAM). You will train the model and then classify handwritten digits.

Fun fact: One of the co-creators of MNIST dataset (Dr. Yann LeCun) is also the co-recicpient of 2018 ACM

Turing award for his work on Neural Networks.

Exercise 5.

For the following problems, please modify the Neural Networks Python notebook code.

1. To run the Gradient Descent (GD) Algorithm: Set the (mini) batch size to 60000 (the size of the MNIST

dataset). Run the SGD optimizer with a learning rate of 0.003.

(1 points) Why is this the same as the GD algorithm if we are using SGD optimizer?

(2 points) What do you observe? Report the accuracy.

(2 points) List at least 2 ways to improve the accuracy.

2. To run the Stochastic Gradient Descent (SGD) Algorithm: Set the (mini) batch size to 64. Run the SGD

optimizer with a learning rate of 0.003.

(2 points) What do you observe? Report the accuracy.

(1 points) List at least one way to improve accuracy further.

3. To run the ADAM algorithm: Set the (mini) batch size to 64. Run the ADAM optimizer with default

learning rates (Algorithm 1)

Hint: You can use Pytorch (or any other library in any language) for setting up the ADAM optimizer.

(2 points) What do you observe? Report the accuracy.

(1 points) Why does ADAM converge faster than SGD?

– Stochastic Optimization Implementation 4

References

[RM51] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of

mathematical statistics (1951), pp. 400–407.

[Sac58] Jerome Sacks. “Asymptotic distribution of stochastic approximation procedures”. In: The Annals

of Mathematical Statistics 29.2 (1958), pp. 373–405.

[NY83] Semenovich Nemirovsky Arkadi and David Borisovich Yudin. “Problem complexity and method

efficiency in optimization.” In: Wiley-Interscience series in discrete mathematics. (1983).

[CZ07] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint. Vol. 24.

Cambridge University Press, 2007.

[Nem+09] Arkadi Nemirovski et al. “Robust stochastic approximation approach to stochastic programming”.

In: SIAM Journal on optimization 19.4 (2009), pp. 1574–1609.

[Bot10] Le´on Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of

COMPSTAT’2010. Springer, 2010, pp. 177–186.

[Bot12] Le´on Bottou. “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade. Springer,

2012, pp. 421–436.

[DGL13] Luc Devroye, La´szlo´ Gyo¨rfi, and Ga´bor Lugosi. A probabilistic theory of pattern recognition. Vol. 31.

Springer Science & Business Media, 2013.

[JZ13] Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive variance

reduction”. In: Advances in neural information processing systems. 2013, pp. 315–323.

[Vap13] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.

[KB14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. arXiv:

1412.6980 [cs.LG].

[MV18] Pierre Moulin and Venugopal Veeravalli. Statistical inference for engineers and data scientists.

Cambridge University Press, 2018.

[SSS18] Sashank, Satyen, and Sanjiv. On the Convergence of Adam and Beyond. 2018.

Acknowledgements

Ehsan Saleh, Anay Pattanik created the first draft of the project outline.

Hongye Liu, Ajay Fewell, Chenhui Zhang, Muhammed Imran, Brian Yang, and Jinglin contributed to the newest

edition.

This document was compiled and inspired by the work and ideas shown in [RM51; Sac58; NY83; Nem+09;

Bot10; JZ13; Bot12; Vap13; DGL13; CZ07; MV18]