Python代写-GR5242
时间:2021-10-27

Advanced Machine Learning GR5242 Fall 2021 Homework 3 Due October 27, 7 pm, NY time. All Problems count the same. No collaboration (of any sort, including with peers from other sections or previous years) is allowed. Instead, the assignment is to be treated as you would a take-home exam. Homework submission: please submit your homework by publishing a notebook that cleanly displays your code, results and plots to pdf. For the non-coding questions, you can typeset using LATEX or markdown, or you can include neatly scanned answers. Please make sure to put everything together into a single pdf file for submission. Assignments won’t be accepted otherwise. Problem 1: Newton Direction and General Hessian Let F (θ) be some objective function over θ ∈ Rd, whereby every Newton step is of the form θ ← θ − η · (∇2F (θ))−1 · ∇F (θ). We considered a simple example in class where H(θ) . = ∇2F (θ) were assumed diagonal, and saw that Newton adjusts ∇F (θ) most towards the coordinate direction arg mini∈[d]Hi,i(θ), i.e., the coordinate with smallest second derivative Hi,i(θ) . = d 2F (θt) d2θi , where θi . = i-th coordinate of θ. Let’s argue the general case. Assume only that H(θ) is symmetric, positive and diagonalizable as H = V ΛV >, for eigenvectors V = [v1, . . . , vd] and eigenvalues λi . = Λi,i, with λi λi+1. (a) Let α . = V >∇F (θ) denote the coordinates of ∇F (θ) in the basis V . Derive the coordinates in the basis V of the Newton direction (∇2F (θ))−1 · ∇F (θ), in terms of αi’s and λi’s. (b) We now see that Newton adjusts the gradient most in a particular direction vi. Which i? (c) Using the multi-dimensional chain rule, argue that v>H(θ)v is just the second derivative of F along direction v, ‖v‖ = 1 (consider differentiating g(t) .= F (θ + tv) at t = 0). It follows that λ1, λd are resp. the highest and smallest second derivatives along any v. Problem 2: Generalized SGD/Adagrad vs Newton Consider an objective F (θ) = 1n ∑ i∈[n] Fi(θ), for θ ∈ Rd, e.g., Fi(θ) = loss(fθ(Xi), Yi) on a dataset {(Xi, Yi)}i∈[n]. For a stochastic gradient ∇˜F (θ) .= ∇Fi(θ), i.e., over a random draw of i ∈ [n] with probability 1/n (whereby E [ ∇˜F ] = ∇F ), we considered SGD steps of the form θ ← θ − C√ t · Λ−1/2 · ∇˜F (θ), for a d× d diagonal matrix Λ whose elements we either set to (i) Λjj = σˆ 2 or (ii) Λjj = σˆ 2 j , i ∈ [d], where σˆ2 estimates E‖∇˜F (θ)‖2, and σˆ2j estimates E [( d dθj Fi(θ) )2] . Let’s consider the case where the estimates are formed from an i.i.d. draw {∇Fis(θ)}s∈[m] of size m n, i.e. σˆ2 = 1 m ∑ s∈[m] ‖∇Fis(θ)‖2 and σˆ2j = 1 m ∑ s∈[m] ( d dθj Fis(θ) )2 . So far, we’ve already seen all this in class, perhaps under slightly different notation. Now consider the following matrix: Σˆ = 1 m ∑ s∈[m] ∇Fis(θ) · ∇>Fis(θ). (a) Express σˆ2 and σˆ2i in terms of Σˆ. (b) A generalization of SGD is to replace Λ with Σˆ, which can be seen as encoding the covariance of ∇˜F , i.e., all directional variances thereof. This leads to the step θ ← θ− C√ t ·Σˆ−1/2·∇˜F (θ). This adjusts the gradient most into a given direction. Which? Problem 3: Compare DBScan and MeanShift This problem is an extension of Homework 1 Problem 8. In that problem you were given a data generator for 3 datasets, on which you had to compare the performances of MeanShift, K-means, and Spectral Clustering. You are to re-use the data generator given in Homework 1 Problem 8, and now run DBScan on the three datasets (we want to compare this to MeanShift on the same datasets. The corresponding Python function DBSCAN is imported as follows: from sklearn.cluster import DBSCAN The DBSCAN function uses two parameters eps and min samples. The first parameter eps corre- sponds to the bandwidth parameter h, in the course notes. The second parameter min samples is proportional to the level λ from the course notes. More precisely, we have λ = min samplesVol(B(0,h)) . Therefore you first have to make appropriate choices of these parameters as follows. • Set eps to be 20% quantile of interpoint distances when using DBSCAN, i.e., same setting as for MeanShift in Homework 1. You can use the function estimate bandwidth() from sklearn.cluster. • For minsamples in the range [100, 400, 700, 1000], pick the smallest value that gives more than one cluster (a) Plot the results of the clustering obtained using DBSCAN for the 3 datasets (as done in Homework 1). (b) Plot the results of clustering using MeanShift (with parameters set as in Homework 1). If you results for Homework 1 were correct you can simply provide those again here. (c) DBSCAN should vastly outperform MeanShift on at least 2 of the datasets? Which 2 datasets? Give a simple explanation as to why DBSCAN might do better there. Problem 4: Regularization - Gradient Boosting We will work on the classical digits dataset for this question. The dataset contains 1797 hand- written digits, 0 to 9, as 8 × 8 gray-scale images. Each datapoint (or feature vector) x is of Page 2 dimension 64, encoding 64 pixel values of an image; the class label y is simply the digit it- self. For instance, Figure 1 shows the image feature of an observation with label 0. For more details, please refer to https://scikit-learn.org/stable/modules/generated/sklearn. datasets.load_digits.html. Figure 1: An example of the digit image with label 0 You are asked to implement gradient boosting in Python to return a classifier from images to digits. Code for loading the dataset: # Load the dataset d = datasets.load_digits () X, y = (d.data , d.target) # Split into training and test sets X_train , X_test = X[:1000], X[1000:] y_train , y_test = y[:1000], y[1000:] Complete the following tasks: (a) Fit sklearn.ensemble.GradientBoostClassifier objects with all the combinations (four in total) of the choices of parameters as follows: • use deviance as the fitting criterion; • total number of iterations (i.e. total number of weak learners): T = 1500; • set max leaf nodes = 4, i.e. the weak learners are trees with at most four leaf nodes. • learning rate: try both η = 0.1 and 0.01; • the fraction of training data to use for gradient fitting: try both p = 0.5 and 1; • leave other parameters as default. For each of the 4 parameter settings, plot the test deviance against the boosting iterations t = 1, . . . , T . (Hint: GradientBoostClassifier.staged decision function() and GradientBoostClassifier.loss () are useful here.) (b) How does stochastic gradient boosting (SGB) compare to the non-stochastic one, especially for larger η = 0.1? Is there any other advantage for SGB, other than the performance on test dataset? (c) For the cases where η = 0.1, will early stopping be helpful? What about for the cases where η = 0.01? Give a simple reason why. (d) Consider for the case where η = 0.1 and p = 1. Determine an optimal stopping time by 10-fold cross-validation. Plot the stopping times as a dashed vertical line and the resulting test deviance as a dashed horizontal line in the same plot for part (a). The candidate stopping times are {20, 30, 40, 60, 80, 100, 200, 500}. (Hint: Use GridSearchCV() func- tion from sklearn.model selection to implement the cross validation on the parameter n estimators.) Page 3 (e) Now, let’s explore the performance under 0-1 misclassification error, which is most common as a criterion of classification performance. It is common folklore, that for 0-1 loss, boosting procedures tend to not overfit as fast as for other performance measures. You are being asked to verify this folklore observation. Train a model as in (a), with T = 3000, η = 0.1 and p = 1, plot the test misclassification error against boosting iterations t, for t = 1, . . . , 3000. Will early stopping be helpful here, and approximately at what iteration would you stop? Problem 5: Bias and Variance Consider a classification problem (X,Y ) where X ∼∑3i=1 aiN (µi, 25Id), and the corresponding label Y = i ∈ {1, 2, 3} represents the component i generating X with probability (a1, a2, a3) = (0.5, 0.25, 0.25). We let d = 2 and µ1 = (−2,−3.5), µ2 = (0, 0), µ3 = (2,−3.5), and . The intent of this question is to illustrate model selection, using an ε-NN (standing for ε nearest neighbors) classifier defined as follows. Given training data {(Xi, Yi)}ni=1 of size n = 500, let εNN(x) denote all Xi in {Xi}ni=1 such that ‖Xi − x‖ ≤ ε. Then the ε-NN classifier at x is given as fˆε(x) = majority label of the εNN(x). In other words, if there are 3 data points in εNN(x) having labels 1, 1, 2, then fˆε(x) = 1. You should use the Python implementation of this classifier, which can be called as below: from sklearn.neighbors import RadiusNeighborsClassifier trainX , trainY = Data_Generate(n) ENN = RadiusNeighborsClassifier(radius = r) ENN.fit(trainX , trainY) The data generator for the above data distribution is given to you as: import random import numpy as np d = 2 mu0 = [[-2,-3.5],[0,0],[2,-3.5]] Id = [[25 ,0],[0,25]] def Data_Generate(n): outX = [] outY = np.random.choice(list(range(3)),n,replace = True , p = [0.5,0.25,0. 25]) for i in range(n): mu = mu0[outY[i]] X = np.random.multivariate_normal(mu ,Id ,1)[0].tolist () outX.append(X) return outX ,outY Our aim to see the effect of the choice of ε, which stands in for inverse model complexity, i.e., the smaller ε is, the more patterns the model can fit, while for extremely large ε the fitting power is greatly reduced. Therefore in what follows we are to compare the test error of fˆε over different values of ε. (a) For large ε, the model fˆε is stable but poor. Show that for all M > 0, there exists A > 0 such that for all ε ≥ A, fˆε is constant over {x ∈ Rd : ‖x‖ ≤M}. (b) Draw and fix a test sample (X ′i, Y ′ i ) n i=1 of the same size n. Page 4 • Generate a training data {(Xi, Yi)}ni=1 and fit the classifier fˆε for ε being the q quantile of interpoint distances, where q = 0.1, 0.2, . . . , 1.0. • If a test sample does not have a training sample in its ε neighborhood, then the prediction by fˆε is not well-defined. Compute the proportion of test samples whose prediction by fˆε are well-defined, for each ε. • For those test samples whose prediction by fˆε are well-defined (suppose there are n0 of them), compute the test error: Rˆ′n(fˆε) = 1 n0 n0∑ i=1 1{fˆε(x′i) 6= y′i}. Repeat the above 100 times. • Plot the average (over repetitions) of Rˆ′n(fˆε) against q and show error bars. • Plot the average (over repetitions) proportion of test samples whose prediction are well- defined against q and show error bars. (c) For which values of ε do we observe higher variances for the two plots? That is, for the smaller or higher values? Explain why. How does the performance change with q? Problem 6: Model Selection (a) Rank the following model classes by their complexity (or state which are equally complex): F1 = { f(x) = w>x : w ∈ R2 } F2 = { f(x) = w>x : w ∈ R20, wi = 0 for i ≤ 19, while w20 ∈ R } F3 = { f(x) = w>x : w ∈ R5000, wi = i for i ≥ 2, while w1 ∈ R } (b) Consider Gradient Descent over Rˆ(w) . = 12‖Xw − Y ‖2, w ∈ Rd, X ∈ Rn×d, Y ∈ Rn. Suppose X>X is of rank k ≤ d. Consider the Gradient Descent path, starting at w0 = 0, wt+1 = wt − η∇Rˆ(wt). Argue that this descent path is at most k-dimensional, that is {wt}t∈N belong to a k- dimensional space, i.e., a hyperplane of dimension k (although we’re operating in Rd). Remark that for this to be a sound procedure that converges, we require an appropriate step-size η so that ‖Id − ηX>X‖op < 1. As a hint, recall the characterization wt = η · (∑t s=1 ( Id − η ·X>X )s−1) ·X>Y, t ≥ 1, and notice that we might write wt = η(At +B)b for b = X >Y ∈ Rd. Alternatively the relation wt+1 = wt − η∇Rˆ(wt) yields the same conclusion by simple induction. Page 5 



































































































































































































































































学霸联盟

Advanced Machine Learning GR5242 Fall 2021 Homework 3 Due October 27, 7 pm, NY time. All Problems count the same. No collaboration (of any sort, including with peers from other sections or previous years) is allowed. Instead, the assignment is to be treated as you would a take-home exam. Homework submission: please submit your homework by publishing a notebook that cleanly displays your code, results and plots to pdf. For the non-coding questions, you can typeset using LATEX or markdown, or you can include neatly scanned answers. Please make sure to put everything together into a single pdf file for submission. Assignments won’t be accepted otherwise. Problem 1: Newton Direction and General Hessian Let F (θ) be some objective function over θ ∈ Rd, whereby every Newton step is of the form θ ← θ − η · (∇2F (θ))−1 · ∇F (θ). We considered a simple example in class where H(θ) . = ∇2F (θ) were assumed diagonal, and saw that Newton adjusts ∇F (θ) most towards the coordinate direction arg mini∈[d]Hi,i(θ), i.e., the coordinate with smallest second derivative Hi,i(θ) . = d 2F (θt) d2θi , where θi . = i-th coordinate of θ. Let’s argue the general case. Assume only that H(θ) is symmetric, positive and diagonalizable as H = V ΛV >, for eigenvectors V = [v1, . . . , vd] and eigenvalues λi . = Λi,i, with λi λi+1. (a) Let α . = V >∇F (θ) denote the coordinates of ∇F (θ) in the basis V . Derive the coordinates in the basis V of the Newton direction (∇2F (θ))−1 · ∇F (θ), in terms of αi’s and λi’s. (b) We now see that Newton adjusts the gradient most in a particular direction vi. Which i? (c) Using the multi-dimensional chain rule, argue that v>H(θ)v is just the second derivative of F along direction v, ‖v‖ = 1 (consider differentiating g(t) .= F (θ + tv) at t = 0). It follows that λ1, λd are resp. the highest and smallest second derivatives along any v. Problem 2: Generalized SGD/Adagrad vs Newton Consider an objective F (θ) = 1n ∑ i∈[n] Fi(θ), for θ ∈ Rd, e.g., Fi(θ) = loss(fθ(Xi), Yi) on a dataset {(Xi, Yi)}i∈[n]. For a stochastic gradient ∇˜F (θ) .= ∇Fi(θ), i.e., over a random draw of i ∈ [n] with probability 1/n (whereby E [ ∇˜F ] = ∇F ), we considered SGD steps of the form θ ← θ − C√ t · Λ−1/2 · ∇˜F (θ), for a d× d diagonal matrix Λ whose elements we either set to (i) Λjj = σˆ 2 or (ii) Λjj = σˆ 2 j , i ∈ [d], where σˆ2 estimates E‖∇˜F (θ)‖2, and σˆ2j estimates E [( d dθj Fi(θ) )2] . Let’s consider the case where the estimates are formed from an i.i.d. draw {∇Fis(θ)}s∈[m] of size m n, i.e. σˆ2 = 1 m ∑ s∈[m] ‖∇Fis(θ)‖2 and σˆ2j = 1 m ∑ s∈[m] ( d dθj Fis(θ) )2 . So far, we’ve already seen all this in class, perhaps under slightly different notation. Now consider the following matrix: Σˆ = 1 m ∑ s∈[m] ∇Fis(θ) · ∇>Fis(θ). (a) Express σˆ2 and σˆ2i in terms of Σˆ. (b) A generalization of SGD is to replace Λ with Σˆ, which can be seen as encoding the covariance of ∇˜F , i.e., all directional variances thereof. This leads to the step θ ← θ− C√ t ·Σˆ−1/2·∇˜F (θ). This adjusts the gradient most into a given direction. Which? Problem 3: Compare DBScan and MeanShift This problem is an extension of Homework 1 Problem 8. In that problem you were given a data generator for 3 datasets, on which you had to compare the performances of MeanShift, K-means, and Spectral Clustering. You are to re-use the data generator given in Homework 1 Problem 8, and now run DBScan on the three datasets (we want to compare this to MeanShift on the same datasets. The corresponding Python function DBSCAN is imported as follows: from sklearn.cluster import DBSCAN The DBSCAN function uses two parameters eps and min samples. The first parameter eps corre- sponds to the bandwidth parameter h, in the course notes. The second parameter min samples is proportional to the level λ from the course notes. More precisely, we have λ = min samplesVol(B(0,h)) . Therefore you first have to make appropriate choices of these parameters as follows. • Set eps to be 20% quantile of interpoint distances when using DBSCAN, i.e., same setting as for MeanShift in Homework 1. You can use the function estimate bandwidth() from sklearn.cluster. • For minsamples in the range [100, 400, 700, 1000], pick the smallest value that gives more than one cluster (a) Plot the results of the clustering obtained using DBSCAN for the 3 datasets (as done in Homework 1). (b) Plot the results of clustering using MeanShift (with parameters set as in Homework 1). If you results for Homework 1 were correct you can simply provide those again here. (c) DBSCAN should vastly outperform MeanShift on at least 2 of the datasets? Which 2 datasets? Give a simple explanation as to why DBSCAN might do better there. Problem 4: Regularization - Gradient Boosting We will work on the classical digits dataset for this question. The dataset contains 1797 hand- written digits, 0 to 9, as 8 × 8 gray-scale images. Each datapoint (or feature vector) x is of Page 2 dimension 64, encoding 64 pixel values of an image; the class label y is simply the digit it- self. For instance, Figure 1 shows the image feature of an observation with label 0. For more details, please refer to https://scikit-learn.org/stable/modules/generated/sklearn. datasets.load_digits.html. Figure 1: An example of the digit image with label 0 You are asked to implement gradient boosting in Python to return a classifier from images to digits. Code for loading the dataset: # Load the dataset d = datasets.load_digits () X, y = (d.data , d.target) # Split into training and test sets X_train , X_test = X[:1000], X[1000:] y_train , y_test = y[:1000], y[1000:] Complete the following tasks: (a) Fit sklearn.ensemble.GradientBoostClassifier objects with all the combinations (four in total) of the choices of parameters as follows: • use deviance as the fitting criterion; • total number of iterations (i.e. total number of weak learners): T = 1500; • set max leaf nodes = 4, i.e. the weak learners are trees with at most four leaf nodes. • learning rate: try both η = 0.1 and 0.01; • the fraction of training data to use for gradient fitting: try both p = 0.5 and 1; • leave other parameters as default. For each of the 4 parameter settings, plot the test deviance against the boosting iterations t = 1, . . . , T . (Hint: GradientBoostClassifier.staged decision function() and GradientBoostClassifier.loss () are useful here.) (b) How does stochastic gradient boosting (SGB) compare to the non-stochastic one, especially for larger η = 0.1? Is there any other advantage for SGB, other than the performance on test dataset? (c) For the cases where η = 0.1, will early stopping be helpful? What about for the cases where η = 0.01? Give a simple reason why. (d) Consider for the case where η = 0.1 and p = 1. Determine an optimal stopping time by 10-fold cross-validation. Plot the stopping times as a dashed vertical line and the resulting test deviance as a dashed horizontal line in the same plot for part (a). The candidate stopping times are {20, 30, 40, 60, 80, 100, 200, 500}. (Hint: Use GridSearchCV() func- tion from sklearn.model selection to implement the cross validation on the parameter n estimators.) Page 3 (e) Now, let’s explore the performance under 0-1 misclassification error, which is most common as a criterion of classification performance. It is common folklore, that for 0-1 loss, boosting procedures tend to not overfit as fast as for other performance measures. You are being asked to verify this folklore observation. Train a model as in (a), with T = 3000, η = 0.1 and p = 1, plot the test misclassification error against boosting iterations t, for t = 1, . . . , 3000. Will early stopping be helpful here, and approximately at what iteration would you stop? Problem 5: Bias and Variance Consider a classification problem (X,Y ) where X ∼∑3i=1 aiN (µi, 25Id), and the corresponding label Y = i ∈ {1, 2, 3} represents the component i generating X with probability (a1, a2, a3) = (0.5, 0.25, 0.25). We let d = 2 and µ1 = (−2,−3.5), µ2 = (0, 0), µ3 = (2,−3.5), and . The intent of this question is to illustrate model selection, using an ε-NN (standing for ε nearest neighbors) classifier defined as follows. Given training data {(Xi, Yi)}ni=1 of size n = 500, let εNN(x) denote all Xi in {Xi}ni=1 such that ‖Xi − x‖ ≤ ε. Then the ε-NN classifier at x is given as fˆε(x) = majority label of the εNN(x). In other words, if there are 3 data points in εNN(x) having labels 1, 1, 2, then fˆε(x) = 1. You should use the Python implementation of this classifier, which can be called as below: from sklearn.neighbors import RadiusNeighborsClassifier trainX , trainY = Data_Generate(n) ENN = RadiusNeighborsClassifier(radius = r) ENN.fit(trainX , trainY) The data generator for the above data distribution is given to you as: import random import numpy as np d = 2 mu0 = [[-2,-3.5],[0,0],[2,-3.5]] Id = [[25 ,0],[0,25]] def Data_Generate(n): outX = [] outY = np.random.choice(list(range(3)),n,replace = True , p = [0.5,0.25,0. 25]) for i in range(n): mu = mu0[outY[i]] X = np.random.multivariate_normal(mu ,Id ,1)[0].tolist () outX.append(X) return outX ,outY Our aim to see the effect of the choice of ε, which stands in for inverse model complexity, i.e., the smaller ε is, the more patterns the model can fit, while for extremely large ε the fitting power is greatly reduced. Therefore in what follows we are to compare the test error of fˆε over different values of ε. (a) For large ε, the model fˆε is stable but poor. Show that for all M > 0, there exists A > 0 such that for all ε ≥ A, fˆε is constant over {x ∈ Rd : ‖x‖ ≤M}. (b) Draw and fix a test sample (X ′i, Y ′ i ) n i=1 of the same size n. Page 4 • Generate a training data {(Xi, Yi)}ni=1 and fit the classifier fˆε for ε being the q quantile of interpoint distances, where q = 0.1, 0.2, . . . , 1.0. • If a test sample does not have a training sample in its ε neighborhood, then the prediction by fˆε is not well-defined. Compute the proportion of test samples whose prediction by fˆε are well-defined, for each ε. • For those test samples whose prediction by fˆε are well-defined (suppose there are n0 of them), compute the test error: Rˆ′n(fˆε) = 1 n0 n0∑ i=1 1{fˆε(x′i) 6= y′i}. Repeat the above 100 times. • Plot the average (over repetitions) of Rˆ′n(fˆε) against q and show error bars. • Plot the average (over repetitions) proportion of test samples whose prediction are well- defined against q and show error bars. (c) For which values of ε do we observe higher variances for the two plots? That is, for the smaller or higher values? Explain why. How does the performance change with q? Problem 6: Model Selection (a) Rank the following model classes by their complexity (or state which are equally complex): F1 = { f(x) = w>x : w ∈ R2 } F2 = { f(x) = w>x : w ∈ R20, wi = 0 for i ≤ 19, while w20 ∈ R } F3 = { f(x) = w>x : w ∈ R5000, wi = i for i ≥ 2, while w1 ∈ R } (b) Consider Gradient Descent over Rˆ(w) . = 12‖Xw − Y ‖2, w ∈ Rd, X ∈ Rn×d, Y ∈ Rn. Suppose X>X is of rank k ≤ d. Consider the Gradient Descent path, starting at w0 = 0, wt+1 = wt − η∇Rˆ(wt). Argue that this descent path is at most k-dimensional, that is {wt}t∈N belong to a k- dimensional space, i.e., a hyperplane of dimension k (although we’re operating in Rd). Remark that for this to be a sound procedure that converges, we require an appropriate step-size η so that ‖Id − ηX>X‖op < 1. As a hint, recall the characterization wt = η · (∑t s=1 ( Id − η ·X>X )s−1) ·X>Y, t ≥ 1, and notice that we might write wt = η(At +B)b for b = X >Y ∈ Rd. Alternatively the relation wt+1 = wt − η∇Rˆ(wt) yields the same conclusion by simple induction. Page 5 

essay、essay代写