STAT7008-Python代写
时间:2022-11-27
STAT7008 Project list
For questions on the topics, you are encouraged to post them on Moodle. For
individual questions, please contact:
Mr. Minghao Yin (yinmh17@connect.hku.hk) for Topics 1-3;
Mr. Silin Cheng (hnslcheng@connect.hku.hk) for Topics 4-6;
Mr. Hongjun Wang (hjwang@connect.hku.hk) for Topics 7-9.
1. Sudoku Puzzle
Sudoku is a logical-based number placement puzzle. The objective is to fill a
9 × 9 grid with digits so that each column, each row, and each of the nine 3 × 3
subgrids that compose the grid (also called "boxes", "blocks", or "regions")
contain all digits from 1 to 9. The puzzle setter provides a partially completed
grid. A well-posed puzzle has only a single solution. A typical Sudoku puzzle and
its solution is shown below.
1) Implement your own Sudoku generation program. The program will generate
sudoku puzzles automatically. For the Sudoku generator, there are three
questions you need to answer:
i. How to ensure the existence of solution to your generated Sudoku puzzles?
ii. How to evaluate the difficulty of generated Sudoku puzzles?
iii. Moreover, how to ensure the solution uniqueness of generated Sudoku
puzzles?
2) Design and implement Sudoku solver that solves the Sudoku puzzles generated
by your own generator. Evaluate the solving speed of Sudoku solver. It is
strongly recommended to design an efficient solver.
2. Revenue Prediction for Movies
Here we provide movie information with a variety of metadata obtained from The
Movie DataBase (TMDB). Movies attributes include id, cast, crew, plot
keywords, budget, posters, release dates, languages, production companies,
countries, etc. A few example columns from Movies Dataframe is shown below.
1) You should predict the revenue of movies (the last column in the Dataframe)
based on the given information. We provide training data in tmdb_train.csv,
testing data in tmdb_test.csv. Design and implement your own revenue
prediction algorithm.
2) For evaluation on testing data, you should use Root Mean Squared Log
Error. Note that, in the formulation, X is the predicted value and Y is the
actual (ground-truth) value.
3) Analyze crucial elements for a movie that influence final revenue the most.
Specifically, you can study the successful key factors for some top selling
movies. Data visualization should be used in analyzing process.
3. K-MNIST Dataset Classification
Kuzushiji-MNIST is a dataset which focuses on Kuzushiji (cursive Japanese),
you can visit https://github.com/rois-codh/kmnist to get to know more about this
dataset. The figure below shows 10 classes of K-MNIST dataset, with the first
column showing each character’s modern hiragana counterpart.
1) Download K-MNIST dataset. Try to do clustering on images without labels.
Evaluate your clustering result by criterion such as NMI (Normalized Mutual
Information) and ARI (Adjusted Rand Index).
2) Implement image dataloader using torchvision package for K-MNIST
dataset, set a proper input batch size. For example, a batch size of 64 for
training and 1000 for testing on this dataset.
3) Implement MLP (multilayer perceptron) model for K-MNIST dataset
classification. You could define a function for the training loop of one epoch,
a function for the testing on validation/test set, which returns the averaged
loss and accuracy on testing set.
4) Implement convolutional neural networks (such as ResNet model) for K-
MNIST dataset classification. (You can use torchvision package, which
provides stable and well tested implementation of various network
architectures)
5) Visualize and analyze the classification result by confusion matrix. Figure
out which classes or what kind of samples are easily misclassified.
6) Generally in the classification process for 3) and 4), there are several steps
you could follow:
a) Configure your (MLP or CNN) model, optimizer, and scheduler.
b) Design training and testing functions.
c) Train the model and test the validation performance each epoch.
d) Visualize training process, by training loss and validation accuracy.
e) Report averaged loss and accuracy on test set.
f) Report the class-wise accuracy on test set.
4. Digital Media Hotspot Mining using word clouds
A word cloud is an image made of words that together resemble a cloudy shape.
The size of a word in the word cloud shows how important it is, e.g., how often it
appears in the text — its frequency. People typically use a word cloud to easily
produce a summary of large documents (reports, speeches), to create art on a topic
(gifts, displays) or to visualize data (tables, surveys).
For example, if we want to get hot words from the news headlines of news web
pages, we can draw a word cloud by the following steps:
(1) Crawl and analyze news headlines from news websites
a) Use the get function of the Requests library to crawl news web pages.
b) Use the findall() function in the re module to extract news titles and store
them in a file;
(2) Segment the text from news titles.
To catch the hot words, you must first segment the words in the headlines of the
newspaper articles. You can use the word breaker
jieba(https://github.com/fxsjy/jieba) or nltk(https://www.nltk.org/) in Python.
(3) Remove punctuations and stop words.
(4) Draw a word cloud according to word frequency
5. Sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use
of natural language processing, text analysis, computational linguistics,
and biometrics to systematically identify, extract, quantify, and study affective
states and subjective information. With the ability of sentiment analysis, machines
can automatically judge the positive and negative emotional tendencies of natural
language texts with subjective descriptions. It is widely used in comment analysis
for decision-making, e-commerce comment classification, public opinion
monitoring, etc.
(1) Since sentiments can be classified into discrete polarities or scales (for
example, positive and negative), we can regard sentiment analysis as a text
classification task, which converts text sequences with varying lengths into a
fixed length. In this project we will use the large movie review dataset of
Stanford University for sentiment analysis
(https://ai.stanford.edu/~amaas/data/sentiment/). It consists of a training set
and a test set, including 25000 movie reviews downloaded from IMDb. In
these two data sets, the numbers of "positive" and "negative" tags are the
same, indicating different sentiment polarities.
(2) First, download the IMDb review dataset.
(3) Process the training and test datasets. Each example is a review with a label: 1
for “positive” and 0 for “negative”. You can take each word as a lexical
element, filter out the words that appear less than 5 times, and create a
vocabulary from the training set.
(4) Use pre-trained word-to-vector models for sentiment analysis. Because the
IMDb dataset is not large, using text representation pre-trained on a large
corpus can reduce the overfitting of the model. We recommend you use the
pre-trained
GloVe(https://github.com/allenai/spv2/blob/master/model/glove.6B.100d.txt.g
z) model to represent each lexical element and feed these lexical elements into
(at least 2) different models (such as multi-layer bidirectional recurrent neural
network, textCNN, etc) to obtain final text sequence representation.
6. Recommender systems
Recommender systems aim to predict users' interests and recommend product
items that are likely to be interesting to them. The objective of this project is to
implement a recommendation system to filter and predict movies that users may
like based on their historical review data。
(1) First, you should understand different filtering strategies in recommendation
systems, including content-based filtering, item-based collaborative filtering,
and user-based collaborative filtering.
(2) Preprocess the data. For example, we do not want to consider movies that
were rated by a small number of users because it is not credible enough.
Similarly, users who only rate a few films should not be considered. Therefore,
considering all these factors, we will reduce the noise by adding some filters to
the final data:
a) A valid movie should be voted on by at least 10 users.
b) A valid user should have voted for at least 50 movies.
(3) Please try to use a variety (at least 2) of recommendation algorithms (content-
based filtering, item-based collaborative filtering, and user-based collaborative
filtering) to complete the recommendation for the provided movie dataset.
7. Dim Sum Classification
Dim Sum is the most famous cuisine in Hong Kong. It contains a large range of
small Chinese dishes that are traditionally enjoyed in restaurants for brunch, like
Shumai, egg tart, sweet cream bun, etc. But newcomers, neither know the name of
specified dishes nor understand Cantonese. This makes things even harder when
the waiter can only speak Cantonese. So our goal is to design a model, which can
automatically recognize the category of Dim Sum and help the user to place the
right order.
(1) Please collect a dataset for Dim Sum. The images can be scraped from the website.
You should at least cover five types of dishes.
(2) Preprocess the collected data.
a) Randomly divide the samples you collected into the training set and the testing
set, meaning they are non-overlapping (i.e. samples in the training set must not
be included in the testing set).
b) The testing set should only contain categories already seen in the training set
(e.g. if the training set only has images of Shumai and egg tart, images of the
sweet cream bun should not exist in the testing set).
c) The testing set should be composed of images from the website and photos
you take.
(3) Design a model to categorize different dishes.
a) Select a suitable model (i.e. SVM, MLP, CNN) and use data from the training
set to train your model.
b) Evaluate performance on the testing set at each epoch
c) Plot training loss, and accuracy on both the training set and testing set through
training.
Note: Please analyze performance of model and give your observations, like difference
between images collected from the website and photos you/your friends take separately.
Optionally, you may further use photos of Dim Sum taken by yourselves to test.
The menu of dim sum is not fixed as time goes by. Teahouse owners gradually added
various snacks called "dim sum" to their offerings. The practice of having tea with
increasing types of dim sum eventually evolved into the modern yum cha (brunch).
(4) If the master of a teahouse plan to roll out a new dish as the specialty of his/her
house (i.e. innovative and only available here), how would our model perform? To
simulate the process, please collect additional images that do not belong to any
category above as the specialty, then evaluate our model using these new samples.
Give the class-wise predictions of our model and analyze the phenomenon.
(5) Can we find a solution for our model to handle this nuisance? For example,
introduce images of that category into the training dataset and retrain the model, or
fine-tune the model using these samples. Compare evaluation results of models
trained by different strategies on the testing set and give the analysis.
8. Make Money in Stocks!
Investing is one of the best ways to build wealth over your lifetime, and it requires
less effort than you might think. You can invest in any famous company like Tesla
Inc (TSLA), Apple Inc (AAPL), NIKE Inc (NKE) and Amazon Inc (AMZN).
There is a lot of data involved in share market trend analysis and in order to start
analyzing, we must first identify which sector we must pick. The focus can either
be on the type of industry like the pharmaceutical sector or on the kind of
investments, like the bond market. Only when you select your sector can you start
analyzing it.
(1) Collect stock data from five years ago, including names of companies, dates,
their prices (open, high, low, close) and trading volume. Save stock data of at
least 50 companies in a .csv file.
(2) Categorize companies according to their sectors. You should split them into at
least five sectors and find the most profitable companies in each sector.
(3) The slope of a trend indicates how much the price should move each day. Steep
lines, moving either upward or downward, indicate a certain trend. Please plot
the trend of stock prices of different companies during the last three years, and
describe your observations. It would be nice if you can visualize the trend sector
by sector.
Market makers are compensated for the risk of holding assets because they may
see a decline in the value of a security after it has been purchased from a seller
and before it's sold to a buyer. Consequently, they commonly charge the
aforementioned spread on each security they cover. For example, suppose an
investor thinks that Meta Platforms Inc. (META), formerly Facebook, is
overvalued at $200 per share and will decline in price. In that case, the investor
could "borrow" 10 shares of Meta from their broker and then sell the shares for
the current market price of $200. If the stock goes down to $125, the investor
could buy the 10 shares back at this price, return the borrowed shares to their
broker, and net $750 ($2,000 - $1,250). However, if Meta's share price rises to
$250, the investor would lose $500 ($2,000 - $2,500).
(4) If time travel was possible, and you could go three months back, could you
arrange your saving and maximize&minimize the profit based on the historical
prices five years ago (except the recent three months)? Suppose you have a
capital fund ($10000) for starting up, provide your model/strategy and give the
analysis.
The stock market trend analysis includes both external and internal forces that
affect it. Changes in a similar industry or the introduction of a new governmental
regulation qualify as forces impacting the market. Analysts then take this data and
attempt to predict the direction the market will take, moving forward.
(5) You might find many turning points in (3). By using the knowledge we learn
from the lecture, could you automatically find out the events that might affect
the stock price on social media when the turning points occur? Please provide
the quantitative evidence and your analysis on why this event affects a company
/ a sector, or even the whole stock market.
Hints:
(i) You can use Yahoo finance API functionality to capture real stock data.
(ii) Making money from stocks does not mean trading often (i.e. buying or
selling), you can also own and hold your securities for a long time.
9. Attitude of Incoming Quarantine Scheme
Social Media platforms are not just the means of connecting with people anymore.
Over time, they have played an essential role in setting notions for various
political parties, for citizens to voice their opinions or spread awareness regarding
different political parties, etc. It’s rather become a medium to voice their opinions.
Digital movements like #StopFundingHate, #BlackLivesMatter, #MeToo, etc.,
have been recognized and discussed globally. Political parties have realized the
social media influence, thereby analyzing the sentiments of the citizens.
(1) As a warmup, select one of the above three hashtags and scrape the relevant
tweets under this hashtag. You should pick a social media platform like Twitter,
Facebook, etc., as per your wish first. The number of scraped tweets should not
be less than 5000.
(2) Did Elon Musk / Donald Trump / Bill Gates post tweets related to the above
topics? If so, pick out all the results. Otherwise, print the five most influential
tweets for him.
Recently, HKSAR government announced the city would switch from a “3+4”
scheme that required arrivals to undergo three days of hotel quarantine and four
more under medical surveillance to a less onerous “0+3” regime, with the change
coming into effect on September 26th, 2022.
(3) Suppose you are hired by the government and assigned the task that investigates
the attitude of ‘netizens’. Scrape the public posts and political texts with certain
hashtags on the chosen social media platform to analyze the generic sentiments
of a country’s citizens regarding that new policy.
a) Group tweets you scraped from different perspectives, like economy,
healthcare, management and happiness of residents.
b) Pick a suitable pretrained model to analyze the texts in different groups.
You can also train a model by yourself if you have any computational
resource.
c) Find out the most positive/neutral/negative tweets for each group by
calculating the VADER sentiment scores of the texts.
d) Visualize the distribution of sentiment scores for each group. We would
like to see the intensity of all positive/neutral/negative feedback for the
new quarantine policy.
e) Based on the above quantitative results. What is your opinion of the new
policy? Should we agree or disagree with it after considering our “opinion
poll”?
Hints: You can use Natural Language Toolkit (NLTK) to calculate the VADER
score.