程序代写案例-IB9CW0
时间:2022-04-13
IB9CW0
Page 1 of 4
Text Analytics
Term Two 2021/2022
WARWICK BUSINESS SCHOOL
[60%] Group Assignment
2000 words:
This is a strict limit not a guideline: any piece submitted with more words than the limit
will result in the excess not being marked. Code fragments and encapsulated comments
do not add in the word count.
Overview and Pedagogical Goal
The goal of this assignment is to familiarize you with the complete process of extracting, refining and
delivering insights of particular business value deriving from unstructured data. This is an individual
assignment where you will work alone in order to build a dataset to address a particular business
problem. The assignment maps to level 7 qualification level and aims to establish your ability to handle
the development of in-depth and original solutions to a domain specific problem of a high business
value.
The task is structured in three (3) parts. The first part (Part A) covers your ability to construct and
demonstrate the handling of text data. It aims to familiarize you with the principles of text mining, the
bag-of-words model and the development of metrics that can be used to analyze structural elements
of text, such as normalizing and cleaning textual corpora. The core of this assignment involves the
translation of these insights to actionable features that can be used to predict an outcome variable of
business interest. Therefore, the second and third parts (Part B and Part C) are concerned with the
identification of features and in particular (a) polarity – whether the text under consideration is
positive or negative, (b) sentiment – the extraction of affective states from the text and (c) the
evaluation and extraction of important topics that are covered and elaborated in the corpus that you
have constructed (Part C).
The report should be written from the perspective of an analyst involving text mining methods in
constructing a professionally written piece of work. This should be both academic as well as practical
and consider possible application scenarios where text mining can be used in a prescriptive analytics
manner.
-
.se#s ntShbBBsEAss--vg
IB9CW0
Page 2 of 4
Marking Criteria and Weights
The marking criteria for all parts of the assignment are as follows:
• Part A: 30% - Completeness of the solution, efficiency of the code, interpretation of the
results.
• Part B: 25% - Completeness of the solution, efficiency of the code, interpretation of the
results.
• Part C: 25% - Completeness of the solution, efficiency of the code, interpretation of the
results.
20% is reserved for the whole academic content in the report distributed equally among the
motivation for the construction of the textual corpus, the interpretation of the outcomes of this
analysis and the convincing line of argument in providing the results.
Feedback
Feedback will be provided in individual sessions upon request with points for further improvement.
Submission Instructions
The assignment solutions should be submitted as one-file pdf document containing both the narrative
for each part as well as the code in the form of a compiled R markdown notebook in html. The students
should combine all files with a zip file with the following naming format:
student_number.pdf
No other files are going to be considered. It is your responsibility to comply with the requirements of
the submission, otherwise this will have repercussions for marking.
Part A: Construction of Corpus – Creating a dataset
For this part you are required to build a corpus of textual information that can be used to evaluate an
outcome variable. Such sources may be tweets, Wikipedia articles, news articles, as well as academic
articles from the literature. Your goal is to associate the text with an outcome variable and therefore
you should provide an overview of how you selected the business case to cover, the ideas behind the
construction of the dataset, the normalization and text mining pipeline you engaged to assemble.
After your dataset is assembled and you have completed all the relevant text normalization parts, you
are required to perform an analysis of word importance as well as construct the document term matrix
to calculate feature importance both at a unigram as well as n-gram level.
For constructing the dataset, you can use sources available online, however you are not allowed to
use already staged datasets such as Amazon or Airbnb reviews. Extra attention will be given to the
innovativeness of the dataset and the application of data staging skills in that regard.
-00
_ ,
ᥝํӷ渨ᖏἬ-adata-npdihvaria. e
n_n
-
Ӟአ %զӤ
Ոժ Amawn,
ᔄٌԾߝጱᦧᦞጱ
go.at fatn
ᶼၥ ՈժᬯԶԾߝጱय-3g
ᬯעӻӧᚆአٌ՜࣐ݢ)
IB9CW0
Page 3 of 4
Part B: Text features and Sentiment association with Target Outcomes
Using polarity and sentiment you are asked to demonstrate how text derived features connect with
the stock price. You can use different aspects of sentiment such as affection categorization as well as
the use of context specific keywords. For achieving that you can use several different dictionaries such
as the Loughran-McDonald, AFIN, NRC, Wordnet Affect etc. For all cases, a regression model should
be used to evaluate the fit of the selected dictionary against the target outcomes. In addition, you are
expected to supplement your analysis to other text metrics that you can extract from your corpus such
as readability, divergence, vocabulary metrics etc.
Part C: Topic Modelling and Latent Dirichlet allocation
Using a topic model, you are requested to provide an analysis of the topics that become dominant
using both an unsupervised and supervised approach. Your analysis should focus on finding which
topics become important from your dataset using a time dimension as well as how the metadata of
the text can influence the prevalence of the topics.
Your topic solution should evaluate among others:
a. The number of topics (Kappa) that need to be created for the particular corpus. You can opt
to use the coherence criterion for the Kappa selection
b. The semantic coherence of the word-topic associations that are going to be created.
c. Any other issues that may arise from the selection of the optimal topic solution.
All solutions need to be properly described and articulated by providing the relevant fragments of
code. Your report should reflect the effort and the steps you took into your analysis and the
potential insights that can be generated by these textual features.
SUBMISSION DEADLINE: Mon 11th April 20:00 (UK time)
Word Count Policy and Formatting (found in your Masters Student Handbook Section 6.2c)
Guidelines for Online Submission (found in your Masters Student Handbook Section 6.2e)
The submission deadline is precise and uploading of the document must be completed before 20.00 (UK time)
on the submission date. Any document submitted even seconds later than 20.00 precisely will be penalised for
late submission in line with WBS policy. Please consult your student handbook on my.wbs for more detailed
information.
The online assignment submission system will only accept documents in portable documents format (PDF) files.
Please note that we will not accept PDF files of scanned documents. You should create your assignment in your
chosen package (for example, Word), then convert it straight to PDF before uploading. Please place your student
ID number, NOT YOUR NAME, on the front of your submission as all submissions are marked anonymously.
✗
_
n_n
-2
- ߺӻ
rariabl b -r .TT
-ᭌग़ӻஞ-_-݊᮷ฎՋԍᵡ metadata᯾ጱ variab.la