R代写-GMGT 7530-Assignment 3|学霸联盟

R代写-GMGT 7530-Assignment 3

时间：2021-07-14

GMGT 7530 (T23)
(Instructors: Dr. Xikui Wang, Dr. Wenxi Pu, Dr. Carson Leung)
Assignment 3 (25%) on textual analysis
Due: July 16th, 2021 (before 12 noon Winnipeg time)
Drop your submissions to the UM Learn assignment 3 folder

Use R to complete the following questions. Submit your R codes in its original form so I can test
run. Include your results and answers to questions 2.a and 3.c in another PDF file.
You are given a set of news articles about entrepreneurs from a list of newspapers headquartered
in the United States (EntrepreneurNewsArticles.zip). Each text file is one news article. The file
names include a random ID, newspaper name, published year (YYYY) and date (MMDD).
News articles from the same newspaper are organized into one folder. Please perform the
following tasks on this dataset.
Remark: There are many R codes available for this data set for various statistical analysis and
machine learning. You are allowed to learn from these codes. However, you must digest and
write your own codes, and attribute the original ideas properly. Install necessary packages. Since
there are many things available online, feel free to go beyond what this assignment asks for.

1. (5%) Preparation of the data set
a. Load the data set into a data frame, such that you have columns for each news
articles’ source (i.e., the newspaper), publishing year and date, and the content
(similar to what you did in the lab, but you need to extract the year and date,
instead of using the filenames directly).
b. Clean the data (similar to what you did in the assignment, you might want to
modify the list of extra stop words that we created).
2. (10%) Deciding the number of topics
a. Use the ldatuning package to derive the optimal number of topics. Please provide
your justifications for your choice. Note that the dataset is a bit larger, so it is
better to sample a manageable subset (say, 25%) from the whole dataset for
deciding the number of topics.
3. (10%) Run LDA model on the cleaned dataset with the derived number of topics
a. Develop label for each topic
b. Get the document-topic distribution.
c. Propose potential research questions and potential datasets that can be merged
with the document-topic distribution you got.

学霸联盟