Dartmouth CS70 Foundations of Applied Computer Science 2021 Spring Final Programming Assignment: Neural Network with Linear Algebra and Nonlinear Optimization In the final project you will implement a neural network from scratch in Python by combining the mathematical modeling techniques you have learned over the quarter including linear algebra, least squares, chain rule, gradient descent, and nonlinear optimization. Your network will be tested on learning the classical hand-written digit classification task with the MNIST dataset. Background Reading Before you get started, read the following materials to make sure you have enough background knowledge for a basic machine learning application. In particular, please make sure you are familiar with the concepts of matrix-matrix multiplication, feature vector, loss function, gradient descent optimization, and the chain rule. We also highly recommend you to finish TA6 before working on these programming tasks. • Course notes • VMLS book Chapter 13, 14 • Flower book Chapter 6 • Starter code (read through all of them, both comments and implementations) • How to build your own Neural Network from scratch in Python • More about forward • More about gradient Task Overview You are asked to implement a data-fitting classifier that can predict the label of an input vector. Specifically, in the context of hand-written digit classification, the label is an integer number between 0 to 9, and the input vector is a vectorized 28× 28 image with each element as a floating point number ranging from 0 to 1 (telling the gray scale of the pixel). You need to solve the problem in both the least-squares way and the neural network way. Appetizer Tasks (5 points) Before diving into the world of neural network, let’s first try to solve the hand-written digit classification problems using our well-understood mathematical tool of least squares. We will use the least-squares to classify a subset of the MNIST hand-written digit dataset. The training dataset for this project has 1000 images, which are stored as a 1000× 784 matrix in train_data.txt (You may refer to the constructor of the class Classifier to learn how to read a matrix from a text file). The testing dataest for this project has 200 images, which are stored as a 200× 784 matrix in test_data.txt. The labels for the training set are stored as a 1D array (or a 1000× 1 matrix) in train_labels.txt and the labels for the testing set are in test_labels.txt. Dartmouth CS70 Foundations of Applied Computer Science 2021 Spring Here is a brief and high-level description of the mathematical model you may consider to implement. You are suggested to solve a least-squares problem XW = Y to fit your data set, with X as a 1000× 785 (with the bias term!) data matrix, W as a 785× 10 model matrix, and Y as a 1000× 10 true value right-hand matrix composed of 1 and−1 (or any values that distinguish the two classes). For easiness of the implementation, you may solve the problem as 10 separate least-squares problems with each column~θi (a 785× 1 vector) as the unknowns. After calculating the values of W, you may use it to predict the class of a new image x by a simple vector-matrix multiplication ~v = ~xW and then pick the index with the maximum value from the vector p = argmaxi{vi}. You need to test your least-squares classifier on the test dataset and report the accuracy (by counting how many images are predicted correctly) of your least-squares model. Hint: you may suffer from a singular X due to the large blank regions in each image. To solve the problem, you need to add a small random number to each element in A to make it solvable in your least-squares solver. For example, you may add a noise ranging from [0, 0.001] to each element in A to perturb its elements. Entree Tasks (10 points) After building your least-squares approximators, your next step is to re-implement the same task by designing a neural network model to improve the learning performance. The key idea is to express the data flow of the neural network with multiple layers. Typically, you may build an architecture composed of linear layers and nonlinear layers alternatively and ended with a loss function layer (i.e., the structure of the network layers is linear->nonlinear->linear->...->loss). For each layer, you need to implement two functions of forward propagation and backward propagation. You will chain each layer together and calculate the correct derivatives to optimize the loss function with gradient descent. The conceptual background and the technical details for implementing the network can be found in the course notes and the referred reading materials listed above. The programming of our handcrafted neural network consists of the following steps: 1. Task 1: Linear layer • Forward (Linear): – Input: X (#samples, #features0) – Output: Y (#samples, #features1) – Stored data: input X (#samples, #features0) In this function you need to forward the data from the previous layer to the next layer by performing a matrix-matrix multiplication. You need to store the input data because you will need this data in the backward step (remember the intermediate data we talked about in our simple 1D network model in class). For simplicity, we did not consider the bias in our NN model. • Backward (Linear): – Input: ∂loss∂Y (#samples, #features1). – Output: ∂loss∂X (#samples, #features0) – Stored data: weight gradient matrix ∂loss∂W (#features0, #features1) The input of the backward function is ∂loss∂Y , which is the partial derivative of the loss with respect to the layer’s output Y. Notice that the shape of ∂loss∂Y is the same as the Dartmouth CS70 Foundations of Applied Computer Science 2021 Spring shape of the output of the Forward function Y. In particular, we have ∂loss ∂X = ∂loss ∂Y ∂Y ∂X = ∂loss ∂Y WT (note that we always store the current weight W) ∂loss ∂W = XT ∂loss ∂Y (Remember that we stored the X in the forward step so you can use it here) • Checkpoint: numerical derivative There is a standard way to validate the correctness of an analytical derivative in computational mathematics. You may always calculate the numerical derivative by following its definition of d f (x)dx = f (x+∆x)− f (x) ∆x . Here a very small ∆x can be used. If your analytical derivatives were implemented correctly, the two results should be consistent with each other. We can use this idea to check the implementation of each network layer before assembling them together. 2. Task 2: Nonlinear activation function • Forward (ReLU): – Input: X (#samples, #features). – Output: Y (#samples, #features) – Stored data: input X (#samples, #features) You will implement the ReLU activation function in this step. Then you need to create a matrix Y with the same size as the input (#samples, #features). The ReLU function is defined as ReLU(x) = max(0, x), (0.1) which is an element-wise operation in the matrix. This means that you need to go through each element in the matrix Y and set Y(i, j) = max(0, A(i, j)). At the same time, you need to store the input data for the future usage in the backward step. • Backward (ReLU): – Input: ∂loss∂Y (#samples, #features). – Output: ∂loss∂X (#samples, #features). Note that the gradient of ReLU is grad(ReLU(x)) = { 1, if x > 0 0, if x ≤ 0 Therefore, by the chain rule, we calculate each element of the output matrix as ∂loss ∂X (i, j) = ∂loss ∂Y (i, j) , if A(i, j) > 0 0 , if A(i, j) ≤ 0 Notice that we need to use the value of A(i, j) that you stored in the forward step for the if-statement in the expression. Dartmouth CS70 Foundations of Applied Computer Science 2021 Spring 3. Task 3: MSE loss function • Forward (Loss): – Input: Ypred (#samples, #features), Ytruth ((#samples, #features). – Output: loss (scalar) – Stored data: input Ypred −Ytruth (#samples, #features) Finally, we will calculate the loss function. First, you need to store the difference between the prediction Ypred (output of the last linear layer) and the ground truth Ytruth as intermediate data (you will need it for the backward step). Then you can calculate the loss as loss = 1 #samples ‖Ypred −Ytruth‖22 • Backward (Loss): – Input: No input. – Output: ∂loss∂Ypred (#samples, #features) It is interesting to notice that the gradient of the loss to the prediction is not depend on the loss itself! ∂loss ∂Ypred = 2 #samples (Ypred −Ytruth). 4. Task 4: Network architecture The network contains a vector of linear layers and a vector of non-linear layers. The sizes of the weights in the linear layers are specified by a vector of pairs of integers, which is the input of the constructor. Each pair corresponds to the weight matrix T in a layer with the matrix dimension specified by the pair. • Forward (pipeline): – Input: X (#samples, #features0) – Output: Ypred (#samples, #features1) You need to propagate the input X from the first layer to the last layer (before the loss function) by calling the forward function of each layer in sequence. The output of the previous forward is the input of the next forward. For example, for a network with k linear layers and k − 1 activation layers, the data flow could be: linear[0] → activation[0] → linear[1] → activation[1] → ... → linear[k− 2] → activation[k− 2] → linear[k− 1]. Note that the first and the last layers should always be linear. • Backward (pipeline): – Input: ∂loss∂Ypred (#samples, #features). – Output: ∂loss∂X (#samples, #features) You need to propagate the gradient of the output (the one we got from the Forward method) back through the network. The traversing order is reversed in the backward propagation. • Checkpoint: regression We provide a sample implementation for your classification training function. There is no implementation requirement for this part. But you may use this checkpoint to validate the correctness of your network architecture (by observing the decreasing of your loss). At the same time, you may read the way it implements the Dartmouth CS70 Foundations of Applied Computer Science 2021 Spring gradient descent and make your own in your next implementation. The function you need to read and mimic is Train_One_Epoch, in which you will establish your gradient descent optimizer. You need to first forward the batched data to the network and to get the loss function. Then you may backward the gradient of the loss back through the loss function and the network. Remember that we have stored the gradients for all the weights in the backward. Thus, you only need to update the weights with stored weight gradients. Do not forget to multiply the negative gradients by the learning rate! Of course you can also play with the learning rate to see how it affects the performance of the gradient decent. 5. Task 5: Classification The frame work is almost the same as the regression. The only difference is that you need to convert the image label from numbers 0-9 to its one-hot encoding vector. For example, the labels [0, 3, 8]T (with the dimension (3,1)) becomes1 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 (0.2) This step is slightly different from the encoding vector you created in the least-squares code (in which case you used -1 instead of 0). 6. Evaluation After finishing everything, we are ready to test the performance of your neural network on finishing the regression and classification challenges. Run the test functions in both classes and observe the performance gain. If everything is implemented correctly, you should be able to observe the decrease of loss and the increase of accuracy in the output. Compare the results with the least squares model. Play with the different parameters in the network to improve its performance. Dessert Task (Extra Credits, up to 5 points) Finish the following task(s) to receive extra credits: 1. Implement the bias in the neural network model for classification. 2. Design a new nonlinear activation function along with its derivative and use it to replace ReLU in the current architecture. Run the same learning tests and evaluate the performance by comparing the results with the standard implementation. 3. Design a new network architecture (e.g., [1linear]→ [2nonlinears]→ [1linear]→ [2nonlinears]→ ...) and evaluate its performance. 4. Employ the network to solve a new regression or classification problem. You need to collect or generate data for the new problem and report the performance of the learning model. Design a new analytical function for regression does not count. 5. The accuracy of your network ranks Top 5 in the class. Teamwork The final project allows teamwork. Each team may have up to two team members. Each team member needs to clarify his/her contribution in the final report. Dartmouth CS70 Foundations of Applied Computer Science 2021 Spring Presentation Each team needs to submit an 8-10 minute presentation video to our Slack channel to present your final project. The presentation should be structured as methodology, code implementation (by going through the source code), live demo (by training a model to fit a given function and demonstrate the accuracy), results discussion, and the main challenges you have faced and addressed in the project. Each team member needs to present his/her own contributions in the project (if it is a team project). Final Report Each team needs to submit a technical report covering the aforementioned aspects (except the live demo) in the presentation requirement. Each team is encouraged to demonstrate the detailed experimental results with charts, tables, and figures. The submission deadline for the report is 24 hours after the presentation. Acknowledgement The CS70 teaching team want to acknowledge Xingzhe He’s contribution on creating the original C++ version of the starter code and the mathematical models, Chongyang Gao and Kang Gu’s contributions on adapting it into the Python environment.
学霸联盟