STAT7008-python代写-Assignment 3
时间:2023-11-27
STAT7008 Programming for Data Science
Department of Statistics and Actuarial Science
The University of Hong Kong
Assignment 3
(Due on Dec. 7, 2023)
There are 8 questions (100 pts total) which covers 4 chapters i.e., machine learning,
web scrapping, deep learning, and computer vision. Please submit the file in either
PDF or HTML format and codes in the provided jupyter notebook.
(Total pages: 4)
Machine Learning
1. (15 pts) Given matrix X, where each row represents a different data point. You are asked to
perform k-means clustering on this dataset using the Euclidean distance as the distance
function. Here k is chosen as 3. The Euclidean distance between a vector x=[x1, x2,…,xp]T
and a vector y=[y1,y2,…,yp]T both in Rp is defined as d = (L2 loss between x and y). All
data in X were given below. Three points i.e., (x3, x5, x8) were randomly chosen as the
initialized centers of three clusters which are µ1= (5.5,3.0), µ2 = (6.5,3.0), µ3 = (6.5,3.5).
X = [x1, x2, x3, x4, x5, x6, x7, x8]
= [(3.5, 3.5), (3.0, 3.0), (5.5, 3.0), (5.5, 2.5), (6.5, 3.5), (5.0, 4.0), (5.5, 3.5), (6.5, 3.0)
(1) What’s the center of the first cluster (µ1) after one iteration? (hints: one iteration
includes two steps i.e., assign data to cluster and estimate the new cluster center). (3
pts)
(2) What’s the center of the second cluster (µ2) after two iterations? (3 pts)
(3) How many iterations are required for K-means algorithm to converge? Specify the
final centers of these three clusters when converged. (6 pts)
(4) Using X, perform simple K-means clustering using scikit-learn library and visualize
the final results. (3 pts)
2. (5 pts) Handwritten digits dataset loading and preprocessing:
(1) Load the digits data by sklearn.datasets.load_digits. (1 pts)
(2) Use MinMaxScaler to normalize the covariates X. (2 pts)
(3) Split the data into training and test set with test_size = 0.2 and random_state = 2020 (2 pts)
3. (5 pts) Following question 2, fit the model specified below with different hyper-parameters,
and report the performance.
(1) Fit the naive bayes model MultinomialNB on the digits training set with different values
of the parameter alpha α∈{1,2,…,20}. (1 pts)
(2) Record the accuracy scores on the test set for each α. (2 pts)
(3) Draw the line plot of the accuracy scores versus different α. (2 pts)
4. (5 pts) Following question 2, apply dimensionality reduction methods applied on the digits
dataset.
(1) Fit Principal Component Analysis (PCA, n_components=2) model to Digits training set
for dimension reduction. (1 pts)
(2) Apply model from (1) to train/test set for dimensionality reduction, compute the 2-
dimensional embedded train/test set. (1 pts)
(3) Fit a nearest neighbor classifier (KNN, n_neighbors=3) on the embedded training set.
Compute the nearest neighbor accuracy on the embedded test set, plot the projected test
set points and show the evaluation score. (1 pts)
(4) Use Neighborhood Components Analysis (NCA, n_components=2) for dimensionality
reduction, repeat (1), (2) and (3). (2 pts)
Note: output results in following image format, no need for outputs in (1) and (2)
Web-scrapping
5. (15 pts) Crawl information from sciencedirect.com
(1) Crawl some key information about all articles published in 2023 from the website
https://www.sciencedirect.com/journal/journal-of-econometrics/issues, including year,
volume, article content, title, authors and pages. Crawl the volume number from 233 to
236 only. (5 pts)
(2) Remove “\xa0” in volume_name and store the crawled data into pandas DataFrame (5 pts)
(3) Filter the author with Null value and then find the top 10 authors that published the most
articles. (5 pts)
Hint:
• Click the button of the targeted item
• Pass the html to BeautifulSoup and get all links
• Use requests to get article content, title, authors and pages for each block
For this example,
article content Research article
title Identification in nonparametric models for dynamic
treatment effects
authors Sukjin Han
pages Pages 132-147
Deep Learning
6. (20 pts) In this question, you need to build a lightweight CNN to do image classification task
for the EMNIST data.
(1) Complete the tasks (implement functions and analysis) as shown in below list. Please
check ‘Question6.ipynb’ for more details.
• Creating Training and Validation Datasets (2 pts)
• Building the model (2 pts)
• Train the model (4 pts)
• Compute the accuracy on the test set (2 pts)
(2) Write down what methods/ parameters you have tried and the detailed analysis of results.
You can write your analysis either in a pdf or in the notebook file. (10 pts)
Note: This is a coding contest based on CNN and feature extraction, and aims to encourage
participants to apply their knowledge towards problem solving.
7. (20 pts) In this assignment, you will be required to understand how to train a neural network
implemented by PyTorch. Before we begin, let’s prepare the environment:
• Open Question7_nn.ipynb provided in the supplementary material to your Jupyter
notebook or Google Colab.
• Install necessary Python libraries, particularly PyTorch。
(1) Design the baseline model in Question7_nn.ipynb based on the forward graph as
shown in the figure 1. (5 pts)
Figure 1 forward graph of the baseline model
(2) Create two models: one with a Sigmoid activation function and another with no
activation function. Compare the baseline model with the current modified versions.
Analyze the results and provide some justification for your observations. (5 pts)
(3) Instead of implementing cross-entropy from PyTorch, try to implement it yourself.
Compare the results between the PyTorch Cross-Entropy loss function and your own
code. (5 pts)
(4) Try to improve the performance of the current baseline model. Briefly describe what
you have improved and the reason behind your design choice. (5 pts)
Note: Please submit the Question7_nn.ipynb to Moodle. Remember to save your running
result of nn.ipynb in Colab and download it for submission.
Computer vision
8. (15 pts) Face and Eye Detection
(1) Please write down the code to detect the faces and the eyes in Question8_face.jpg. Draw
the red rectangle for the faces and the green rectangle for the eyes. (10 pts)
(2) If we want to open the front camera for video capturing and performing face and eye
detection. How can we modify the above codes? (5 pts)
Hints: you may use the auxiliary .xml files and the detection algorithm based on Haar-like
features, provided by opencv.
essay、essay代写