机器学习代写-CS5785 Homework|学霸联盟

机器学习代写-CS5785 Homework

时间：2020-11-24

The homework is generally split into programming exercises and written exercises. This homework is due on Dec 3rd, 2020 at 11:59 PM ET. Upload your homework to Gradescope (Canvas->Gradescope). There are two assignments for this homework in Gradescope. Please note a complete submission should include: 1. A write-up as a single .pdf file. −→ Submit to “Homework 4- Write Up”. 2. Source code for all of your experiments (AND figures) zipped into a single .zip file, in .py files if you use Python or .ipynb files if you use the IPython Notebook. If you use some other language, include all build scripts necessary to build and run your project along with instructions on how to compile and run your code. If you use the IPython Notebook to create any graphs, please make sure you also include them. −→ Submit to “Homework 4- Code”. The write-up should contain a general summary of what you did, how well your solution works, any insights you found, etc. On the cover page, include the class name, homework number, and team member names. You are responsible for submitting clear, organized answers to the questions. You could use online LATEX templates from Overleaf, under “Homework Assignment” or “Project / Lab Report”. Please include all relevant information for a question, including text response, equations, figures, graphs, output, etc. If you include graphs, be sure to include the source code that generated them. Please pay attention to Canvas for relevant information regarding updates, tips, and policy changes. You are encouraged (but not required) to work in groups of 2. IF YOU NEED HELP There are several strategies available to you. • If you get stuck, we encourage you to post a question on the Discussions section of Canvas. That way, your solutions will be available to other students in the class. • The professor and TAs offer office hours, which are a great way to get some one-on-one help. • You are allowed to use well known libraries such as scikit-learn, scikit-image, numpy, scipy, etc. for this assignment. Any reference or copy of public code repositories should be properly cited in your submission (examples include Github, Wikipedia, Blogs). CS5785 Fall 2020: Homework 4 Page 2 PROGRAMMING EXERCISES You will need to download extra data files for the following problems. You can find them on Canvas under Files->Homework_data. 1. Clustering for text analysis. (20 pts) In this problem, you will analyze all the articles from the journal Science in the year 2000. (Thanks to JSTOR for providing the data.) Many of the parameters of this analysis will be left for you to decide. For these files you will need to use science2k-vocab.npy and science2k-titles.npy, which are vectors of terms and titles respectively. (a) (10 pts) The extra data file science2k-doc-word.npy contains a 1373 × 5476 matrix, where each row is an article in Science described by 5476 word features. The articles and words are in the same order as in the vocabulary and titles files above. You can read this file using numpy.load("science2k-doc-word.npy") To obtain the features, we performed the following transformation. First, we computed perdocument smoothed word frequencies. Second, we took the log of those frequencies. Finally, we centered the per-document log frequencies to have zero mean. Cluster the documents using k-means and various values of k (go up to at least k = 20). Select a value of k. For that value, report the top 10 words of each cluster in order of the largest positive distance from the average value across all data. More specifically, if x is the 5476-vector of average values across documents and mi is the ith mean, report the words associated with the top components in mi −x. Report the top ten documents that fall closest to each cluster center. You can find the titles in the science2k-titles.txt file. Comment on these results. What has the algorithm captured? How might such an algorithm be useful? (b) (10 pts) The file science2k-word-doc.npy is similar, but capture term-wise rather than documentwise features. That is, for each term, we count the frequency as the number of documents that term appears in rather than the other way around. This allows us to characterize individual terms. This matrix is 5476×1373, where each row is a term in Science described by 1373 “document” features. These are transformed document frequencies (as above). Repeat the analysis above, but cluster terms instead of documents. The terms are listed in science2k-vocab.txt Comment on these results. How might such an algorithm be useful? What is different about clustering terms from clustering documents? 2. EM algorithm and implementation (25 pts) (a) (5 pts) Download the Old Faithful Geyser Dataset. The data file contains 272 observations of (eruption time, waiting time). Treat each entry as a 2 dimensional feature vector. Parse and plot all data points on 2-D plane. (b) (10 pts) Implement a bimodal GMM model to fit all data points using EM algorithm. Explain the reasoning behind your termination criteria. For this problem, we assume the covariance 2 CS5785 Fall 2020: Homework 4 Page 3 matrix is spherical (i.e., it has the form of σ2I for scalar σ) and you can randomly initialize Gaussian parameters. For evaluation purposes, please submit the following figures: • Plot the trajectories of two mean vectors in 2 dimensions (i.e., coordinates vs. iteration). • Run your program for 50 times with different initial parameter guesses. Show the distribution of the total number of iterations needed for algorithm to converge. (c) (10 pts) Repeat the task in (b) but with the initial guesses of the parameters generated from the following process: • Run a k-means algorithm over all the data points with K = 2 and label each point with one of the two clusters. • Estimate the first guess of the mean and covariance matrices using maximum likelihood over the labeled data points. Compare the algorithm performances of (b) and (c). 3. Eigenface for face recognition (40 pts). In this assignment you will implement the Eigenface method for recognizing human faces. You will use face images from The Yale Face Database B, where there are 64 images under different lighting conditions per each of 10 distinct subjects, 640 face images in total. With your implementation, you will explore the power of the Singular Value Decomposition (SVD) in representing face images. Read more (optional): • Eigenface on Wikipedia: https://en.wikipedia.org/wiki/Eigenface • Eigenface on Scholarpedia: http://www.scholarpedia.org/article/Eigenfaces (a) (2 pts) Download and unzip The Face Dataset (faces.zip), You will find a folder called images which contains all the training and test images; train.txt and test.txt specifies the training set and test (validation) set split respectively, each line gives an image path and the corresponding label. (b) (2 pts) Load the training set into a matrix X: there are 540 training images in total, each has 50 × 50 pixels that need to be concatenated into a 2500-dimensional vector. So the size of X should be 540×2500, where each row is a flattened face image. Pick a face image from X and display that image in grayscale. Do the same thing for the test set. The size of matrix Xtest for the test set should be 100×2500. Below is the sample code for loading data from the training set. You can directly run it in Jupyter Notebook: 1 import numpy as np 2 from scipy import misc 3 CS5785 Fall 2020: Homework 4 Page 4 3 from matplotlib import pylab as plt 4 import matplotlib.cm as cm 5 %matplotlib inline 67 train_labels, train_data = [], [] 8 for line in open('./faces/train.txt'): 9 im = misc.imread(line.strip().split()[0]) 10 train_data.append(im.reshape(2500,)) 11 train_labels.append(line.strip().split()[1]) 12 train_data, train_labels = np.array(train_data, dtype=float), np.array(train_labels, dtype=int) 13 14 print train_data.shape, train_labels.shape 15 plt.imshow(train_data[10, :].reshape(50,50), cmap = cm.Greys_r) 16 plt.show() (c) (3 pts) Average Face. Compute the average face µ from the whole training set by summing up every row in X then dividing by the number of faces. Display the average face as a grayscale image. (d) (3 pts) Mean Subtraction. Subtract average face µ from every row in X. That is, xi := xi − µ, where xi is the i-th row of X. Pick a face image after mean subtraction from the new X and display that image in grayscale. Do the same thing for the test set Xtest using the precomputed average face µ in (c). (e) (10 pts) Eigenface. Perform eigendecomposition on XT X = VΛVT to get eigenvectors VT , where each row of VT has the same dimension as the face image. We refer to vi , the i-th row of VT , as i-th eigenface. Display the first 10 eigenfaces as 10 images in grayscale. (f ) (10 pts) Eigenface Feature. The top r eigenfaces VT [: r,:] = {v1, v2,..., vr }T span an r -dimensional linear subspace of the original image space called face space, whose origin is the average face µ, and whose axes are the eigenfaces {v1, v2,..., vr }. Therefore, using the top r eigenfaces {v1, v2,..., vr }, we can represent a 2500-dimensional face image z as an r -dimensional feature vector f: f = VT [: r,:] z = [v1, v2,..., vr ]T z. Write a function to generate r -dimensional feature matrix F and Ftest for training images X and test images Xtest, respectively (to get F, multiply X to the transpose of first r rows of VT , F should have same number of rows as X and r columns; similarly for Xtest). (g) (10 pts) Face Recognition. Extract training and test features for r = 10. Train a Logistic Regression model using F and test on Ftest. Report the classification accuracy on the test set. Plot the classification accuracy on the test set as a function of r when r = 1, 2,..., 200. Use “one-vs-rest” logistic regression, where a classifier is trained for each possible output label. Each classifier is trained on faces with that label as positive data and all faces with other labels as negative data. sklearn calls this “ovr” mode. 4 CS5785 Fall 2020: Homework 4 Page 5 WRITTEN EXERCISES 1. SVD and eigendecomposition. (5 pts) Show that from the Singular Value Decomposition (SVD) of a matrix X, we can obtain the eigendecomposition of XT X. This tells us that we can do an SVD of X and get same result as the eigendecomposition of XT X but the SVD is faster and easier. 2. Weights for clustering (ESL Exercise 14.1). (5 pts) In clustering algorithms like K-means, we need to compute distances in the feature space. Sometimes people use weights to value some feature more than others. Show that weighted Euclidean distance for p dimensional data points xi and xi0 d(w) e (xi ,xi0) = Ppl=1 wl(xi l − xi0l)2 Ppl=1 wl (1) satisfies d(w) e (xi ,xi0) = de (zi , zi0) = Xpl=1(zi l − zi0l)2 , (2) where zi l = xi l · Ã wl Ppl=1 wl !1/2 . (3) Thus weighted Euclidean distance based on x is equivalent to unweighted Euclidean distance based on a proper transformed data z. 3. SVD of Rank Deficient Matrix. (10 pts) You can use NumPy library for this problem. In this extreme example of dimensionality reduction, we are going to see the original data’s dimensionality is redundant and can be completely captured by lower dimensions. Consider matrix M. It has rank 2, as you can see by observing that there times the first column minus the other two columns is 0. M =  1 0 3 3 7 2 2 −2 8 0 −1 1 5 8 7  . (4) (a) Compute the matrices MT M and MMT . (b) Find the eigenvalues for your matrices of part (a). (c) Find the eigenvectors for the matrices of part (a). (d) Find the SVD for the original matrix M from parts (b) and (c). Note that there are only two nonzero eigenvalues, so your matrix Σ should have only two singular values, while U and V have only two columns. (e) There are 2 non-zero singular values, if we only keep one by setting the smaller singular value to 0 , then the data will be represented in 1D only. Compute such one-dimensional approximation to M.