CMT309-无代写
时间:2023-04-21
Cardiff School of Computer Science and Informatics
Coursework Assessment Pro-forma
Module Code : CMT309
Module Title : Computational Data Science
Lecturer : Dr. Oktay Karakus, Dr. Luis Espinosa-Anke
Assessment Title : CMT309 Data Analysis Portfolio
Assessment Number : 2
Date set : 03-02-2022
Submission date and time : 12-05-2022 at 9:30am
Return date : 09-06-2022
Extenuating Circumstances submission deadline will be 1 week after the submission date
above.
Extenuating Circumstances marks and feedback return will be 1 week after the feedback re-
turn date above.
This assignment is the CMT309 Data Science Portfolio, which accommodates 70% of the total marks
available for this module. If coursework is submitted late (and where there are no extenuating circum-
stances):
1.) If the assessment is submitted no later than 24 hours after the deadline, the mark for the
assessment will be capped at the minimum pass mark;
2.) If the assessment is submitted more than 24 hours after the deadline, a mark of 0 will be given
for the assessment.
Extensions to the coursework submission date can only be requested using the Extenuating Circum-
stances procedure. Only students with approved extenuating circumstances may use the extenuating
circumstances submission deadline. Any coursework submitted after the initial submission deadline
without approved extenuating circumstances will be treated as late.
More information on the extenuating circumstances procedure can be found on the Intranet: https://
intranet.cardiff.ac.uk/students/study/exams-and-assessment/extenuating-circumstances
By submitting this assignment you are accepting the terms of the following declaration:
I hereby declare that my submission (or my contribution to it in the case of group submissions)
is all my own work, that it has not previously been submitted for assessment and that I have
not knowingly allowed it to be copied by another student. I understand that deceiving or
attempting to deceive examiners by passing off the work of another writer, as one’s own is
plagiarism. I also understand that plagiarising another’s work or knowingly allowing another
student to plagiarise from my work is against the University regulations and that doing so will
result in loss of marks and possible disciplinary proceedings1
Assessment
(1) You have to upload the files mentioned in Submission Instructions section below.
(2) Failing to follow submitted file names, and file types (e.g. naming your file p1.py instead of
P1.py) will have a penalty of 10 points from your total mark.
(3) The coursework includes different datasets, which are automatically downloaded. Since these
files are already with the markers, students do not need to submit these files back.
(4) Changing the txt file names, and developing your codes with those changed file names would
cause errors during the marking since the markers will use a Python marking code developed
with the original file names.
1https://intranet.cardiff.ac.uk/students/study/exams-and-assessment/academic-integrity/cheating-and-academic-
misconduct
Page 2
(5) You can use any Python expression or package that was used in the lectures and practical
sessions. Additional packages are not allowed unless instructed in the question. Failing to
follow this rule might cause to lose all marks for that specific part of the question(s).
(6) You are free to use any Python environment, or version to develop your codes. However, you
should fill and test your notebook in Google Colab since testing and marking process will be
done via Google Colab.
(7) If any submitted code for any sub-question fails to run in Google Colab, that part of the code
will be marked as 0 without testing the code in Jupyter, or any other environment.
(8) It is not allowed to use input() function to ask user to enter values.
(9) If a function is asked to be developed, the name and input arguments of that function should be
as the same as instructed in the paper.
Learning Outcomes Assessed
• Carry out data analysis and statistical testing using code
• Critically analyse and discuss methods of data collection, management and storage
• Extract textual and numeric data from a range of sources, including online
• Reflect upon the legal, ethical and social issues relating to data science and its applications
Criteria for assessment
Credit will be awarded against the following criteria. Different criteria are applied to pandas code
(using pandas outside of a function), function code, and figures obtained with matplotlib or seaborn
. pandas code is exclusively judged by its functionality. Functions are judged by their functionality and
additionally their quality will be assessed. Figures are judged by their quality and completeness. The
below tables explain the criteria.
Mark Functionality (80%) Quality (20%)
Distinction
(70-100%)
Fully working application that
demonstrates an excellent
understanding of the assignment
problem using relevant python approach
Excellent
documentation with
usage of docstring
and comments
Functions
Merit
(60-69%)
All required functionality is met, and the
application are working probably with
some minors’ errors
Good documentation
with minor missing of
comments
Pass
(50-59%)
Some of the functionality developed with
and incorrect output major errors
Fair documentation
Fail
(0-50%)
Faulty application with wrong
implementation and wrong output
No comments or
documentation at all
Mark Functionality (100%)
Distinction
(70-100%)
Fully working application that demonstrates an excellent
understanding of the assignment problem using relevant python
approach
Pandas
Code
Merit
(60-69%)
All required functionality is met, and the application are working
probably with some minors’ errors
Pass
(50-59%)
Some of the functionality developed with and incorrect output
major errors
Fail
(0-50%)
Faulty application with wrong implementation and wrong output
Page 3
Mark Quality and completeness (100%)
Distinction
(70-100%)
Excellent figures with complete and informative data and
formatting, labels, titles, and legends if appropriate
Figures
Merit
(60-69%)
Good figures with good formatting, labels, titles, and legends
Pass
(50-59%)
Acceptable figures with missing information, bad formatting,
labels, titles, or legends
Fail
(0-50%)
Faulty or missing figures
Mark Quality (100%)
Distinction
(70-100%)
In addition to the requirements for Merit, there is a scholarly
approach, including references to external resources or types of
biases not covered in class.
Ethics
Merit
(60-69%)
Significant discussion is provided, with deep mapping between
several or all sources of bias and the argumentation.
Pass
(50-59%)
Some discussion is provided, and shallow mapping between few
sources of bias and discussion is provided.
Fail
(0-50%)
Incomplete discussion, sources of bias no discussed or
discussed with major mistakes.
Feedback and suggestion for future learning
Feedback on your coursework will address the above criteria. Feedback and marks will be returned
within 4 weeks of your submission date via Learning Central. In case you require further details, you
are welcome to schedule a one-to-one meeting.
Submission Instructions
Start by downloading P1.ipynb, and P2.ipynb from Learning Central, then answer the following
questions. You can use any Python expression or package that was used in the lectures and practical
sessions. Additional packages are not allowed unless instructed in the question. You answer the
questions by filling in the appropriate sections in the Jupyter Notebook.
Your coursework should be submitted via Learning Central by the above deadline. You have to upload
the following files:
Description Type Name
Your solution to part 1 Compulsory
One jupyter notebook
(.ipynb) file
P1.ipynb
Your solution to part 2 Compulsory
One jupyter notebook
(.ipynb) file
P2.ipynb
Make sure to include your student number as a comment in all of the Python files! Any deviation
from the submission instructions (including the number and types of files submitted) may result in a
reduction of marks for the assessment or question part.
You can submit multiple times on Learning Central. ONLY files contained in the last attempt
will be marked, so make sure that you upload all files in the last attempt.
Staff reserve the right to invite students to a meeting to discuss the Coursework submissions.
Page 4
Part 1 - Text Data and Ethics (45 marks)
This part covers the course content of weeks 1 to 4. It is advised students to complete this
part by the end of Week 5, to better plan your time and leave enough preparation time for the
second part of the assignment!
In this question you will write Python code for processing, analysing and understanding the social
network Reddit (www.reddit.com). Reddit is a platform that allows users to upload posts and com-
ment on them, and is divided in subreddits, often covering specific themes or areas of interest (for
example, world news, ukpolitics or nintendo). You are provided with a subset of Reddit with posts
from Covid-related subreddits (e.g., CoronavirusUK or NoNewNormal), as well as randomly selected
subreddits (e.g., donaldtrump or razer ).
The csv dataset you are provided contains one row per post, and has information about three entities:
posts, users and subreddits. The column names are self-explanatory: columns starting with the
prefix user_ describe users, those starting with the prefix subr_ describe subreddits, the subreddit
column is the subreddit name, and the rest of the columns are post attributes ( author, posted_at,
title and post text - the selftext column-, number of comments - num_comments, score, etc.).
In this exercise, you are asked to perform a number of operations to gain insights from the data.
P1.1 - Text data processing (20 marks)
P1.1.1 - Offensive authors per subreddit (5 marks)
As you will see, the dataset contains a lot of strings of the form [***]. These have been used to mask
(or remove) swearwords to make it less offensive. We are interested in finding those users that have
posted at least one swearword in each subreddit. We do this by counting occurrences of the [***]
string in the selftext column (we can assume that an occurrence of [***] equals a swearword in
the original dataset).
What to implement: A function offensive_authors(df) that takes as input the original dataframe
and returns a dataframe of the form below, where each row contains authors that posted at least one
swearword in the corresponding subreddit.
subreddit author
0 40kLore Cross_Ange
1 40kLore DaRandomGitty2
2 40kLore EMB1981
3 40kLore Evoxrus_XV
4 40kLore Grtrshop
P1.1.2 - Most common trigrams per subreddit (15 marks)
We are interested in learning about the ten most frequent trigrams (a trigram is a sequence of
three consecutive words) in each subreddit’s content. You must compute these trigrams on both
the selftext and title columns. Your task is to generate a Python dictionary of the form:
{subreddit1: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
subreddit2: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
...
subreddit63: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],}
That is, for each subreddit, the 10 most frequent trigrams and their frequency, stored in a list of tuples.
Each trigram will be stored also as a tuple containing 3 strings.
What to implement: A function get_tris(df, stopwords_list, punctuation_list) that will take
as input the original dataframe, a list of stopwords and a list of punctuation signs (e.g., ? or !), and will
Page 5
return a python dictionary with the above format. Your function must implement the following steps in
order:
• (1 mark) Create a new dataframe called newdf with only subreddit, title and selftext
columns.
• (1 mark) Add a new column to newdf called full_text, which will contain title and selftext
concatenated with the string ’.’ (a full stop) followed by a space. That, is A simple title and
This is a text body would be A simple title. This is a text body.
• (1 mark) Remove all occurrences of the following strings from full_text. You must do this
without creating a new column:
– [***]
– &
– >
– https
• (1 mark) You must also remove all occurrences of at least three consecutive hyphens, for
example, you should remove strings like ’---’, ’----’, ’-----’, etc., but not ’--’ and not
’-’.
• (1 mark) Tokenize the contents of the full_text column after lower casing (removing all cap-
italization). You should use the word_tokenize function in nltk. Add the results to a new
column called full_text_tokenized.
• (2 mark) Remove all tokens that are either stopwords or punctuation from full_text_tokenized
and store the results in a new column called full_text_tokenized_clean. See Note 1.
• (2 marks) Create a new dataframe called adf (which will stand for aggregated dataframe),
which will have one row per subreddit (i.e., 63 rows), and will have two columns: subreddit
(the subreddit name), and all_words, which will be a big list with all the words that belong to
that subreddit as extracted from the full_text_tokenized_clean.
• (3 marks) Obtain trigram counts, which will be stored in a dictionary where each key will be
a trigram (a tuple containing 3 consecutive tokens), and each value will be their overall fre-
quency in that subreddit. You are encouraged to use functions from the nltk package, although
you can choose any approach to solve this part.
• (3 marks) Finally, use the information you have in adf for generating the desired dictionary, and
return it. See Note 2.
Note 1. You can obtain stopwords and punctuation as follows.
• Stopwords:
>>>from nltk.corpus import stopwords
>>>stopwords = stopwords.words(’english’)
• Punctuation:
>>>import string
>>>punctuation = list(string.punctuation)
Note 2. You do not have to apply an additional ordering when there are several trigrams with the
same frequency.
P1.2 - Answering questions with pandas (15 marks)
In this question, your task is to use pandas to answer questions about the data.
P1.2.1 - Authors that post highly commented posts (3 marks)
Find the top 1000 most commented posts. Then, obtain the names of the authors that have at least
3 posts among these posts.
What to implement: Implement a function find_popular_authors(df) that takes as input the orig-
inal dataframe and returns a list strings, where each string is the name of authors that satisfy the
above criteria.
Page 6
P1.2.2 - Distribution of posts per weekday (5 marks)
Find the percentage of posts that were posted in each weekday (Monday, Tuesday, etc.). You can
use an external calendar or you can use any functionality for dealing with dates available in pandas.
What to implement: A function get_weekday_post_distribution(df) that takes as input the origi-
nal dataframe and returns a dictionary of the form (the values are made up):
{’Monday’: ’14%’,
’Tuesday’: ’23%’,
... }
Note that you must only return two decimals, and you must include the percentage sign in the output
dictionary.
Note that in dictionaries order is not preserved, so the order in which it gets printed will not matter.
P1.2.3 - The 100 most passionate redditors (7 marks)
We would like to know which are the 100 redditors (author column) that are most passionate. We
will measure this by checking, for each redditor, the ratio at which they use adjectives. This ratio will
be computed by dividing number of adjectives by the total number of words each redditor used. The
analysis will only consider redditors that have written at least 1000 words.
What to implement: A function called get_passionate_redditors(df) that takes as input the orig-
inal dataframe and returns a list of the top 100 redditors (authors) by the ratio at which they use
adjectives considering both the title and selftext columns. The returned list should be a list of
tuples, where each inner tuple has two elements: the redditor (author) name, and the ratio of adjec-
tives they used. The returned list should be sorted by adjective ratio in descending order (highest
first). Only redditors that wrote more than 1000 words should be considered. You should use nltk’s
word_tokenize and pos_tag functions to tokenize and find adjectives. You do not need to do any
preprocessing like stopword removal, lemmatization or stemming.
P1.3 Ethics (10 marks)
Imagine you are the head of a data mining company that needs to use the insights gained in this
assignment to scan social media for covid-related content, and automatically flag it as conspiracy
or not conspiracy (for example, for hiding potentially harmful tweets or Facebook posts). Some
information about the project and the team:
• Your client is a political party concerned about misinformation.
• The project requires mining Facebook, Reddit and Instagram data.
• The team consists of Joe, an American mathematician who just finished college; Fei, a senior
software engineer from China; and Francisco, a data scientist from Spain.
Reflect on the impact of exploiting data science for such an application. You should map your discus-
sion to one of the five actions outlined in the UK’s Data Ethics Framework.
Your answer should address the following:
• Identify the action in which your project is the weakest.
• Then, justify your choice by critically analyzing the three key principles for that action outlined
in the Framework, namely transparency, accountability and fairness.
• Finally, you should propose one solution that explicitly addresses one point related to one of
these three principles, reflecting on how your solution would improve the data cycle in this
particular use case.
Your answer should be between 500 and 700 words. You are strongly encouraged to follow a
scholarly approach, e.g., with references to peer reviewed publications. References do not
count towards the word limit.
Page 7
Part 2 - Numerical Data (55 marks)
This question has been created to test your statistical analysis and programming knowledge in
Python.
You are given a csv file, which include various data entries for UFC fights played between 1994
up to 2021. Each row presents several statistics for a fighter in a specific fight (Match_ID) where
the result of the fight is stored in (Winner). Descriptions for some other important column names is
automatically downloaded in a .txt file.
In this exercise, you are asked to perform a number of operations to (1) perform statistical analysis of
the data, and (2) gain insights from the data.
P2.1 - Probability and Visualisation using pandas (15 marks)
In this question, your task is to use pandas and other required modules to query and analyse the data
frame, df.
P2.1.1 - The Tall, Young and Winner (3 marks)
Find the probability of a fighter winning the fight while (1) being younger than 25 years of age, and
(2) taller than 180 cms. Write a one-liner to solve this.
P2.1.2 - The most durable fighters (3 marks)
Find the most durable fighters among all the fighters in the dataframe. You need to filter out df for
the 10 most durable fighters who wins where they received higher than 100 significant strikes
’landed’, but not had any knockdowns (KD).
P2.1.3 - KO-Machines (3 marks)
Find probability of fighters (at the beginning of each fight!) who has at least 2 KO/TKOs ,and whose
win by KO/TKO ratio is higher than 0.75.
P2.1.4 - Ideal Body & Strategy (6 marks)
Create a figure with two subplots by using a filtered-out version of the dataframe df including only
rows of WINNERS.
Subplot 1: Create a 2D histogram via hexagonal bins for ’Height_cms’ and ’Weight_lbs’ columns
of the data frame. Color corresponds to number of winners for each height-weight pair.
Subplot 2: Create a heatmap figure for plotting correlations between columns of
[’Avg_KD’, ’Avg_REV’, ’Avg_SIG_STR_landed’, ’Avg_TOTAL_STR_landed’ ’Avg_TD_landed’,
’Avg_HEAD_landed’, ’Avg_BODY_landed’, ’Avg_LEG_landed’ ’Avg_DISTANCE_landed’,
’Avg_CLINCH_landed’, ’Avg_GROUND_landed’, ’Avg_CTRL_time’]
In order to solve this question, you need to use either pandas or matplotlib library visualisa-
tion commands. You cannot use seaborn, plotly or other libraries for this question.
You do not have to replicate the figures given below, but axes labels, titles and other visuali-
sation details should be there.
P2.2 - Variable selection for Regression Analysis (9 marks)
In variable selection (’variable’ means the same as ’predictor’), variables get iteratively added or
removed from the regression model. Once finished, the model typically contains only a subset of the
original variables. It makes it easier to interpret the model, and in some cases it makes it generalise
better to new data.
Page 8
Figure 1: Top: Example of P2.1.4. Bottom-left: Example of P2.4.1. Bottom-right: Example of
P2.4.2.
To perform variable selection, create a function select_variable(df, main_pred, main_target,
alpha), where
• main_pred is a list of variables includes columns of the data frame except for ’Fighter’, ’
Referee’, ’Date’ and ’Match_ID’.
• main_target is the variable for the regression, ’Winner’
• alpha is the significance level for selecting significant predictors
The function should return
• main_pred is a list which stores the selected subset of initial main_pred.
To calculate regression fits and p-values you will use statsmodels. The general procedure follows
two stages:
• Stage 1 (adding predictors): you build a model by adding variables one after the other. You
keep adding variables that increase the adjusted R2 value (provided by statsmodels package).
– Start with an empty set of variables
– Fit multiple one-variable regression models. In each iteration, use one of the variables
provided in predictors. The variable that leads to the largest increase in adjusted R2 is
added to the model.
Page 9
– Now proceed by adding a second variable into the model. Starting from the remaining
variables, again choose the variable that leads to the largest increase in adjusted R2.
– Continue in the same way for the third, fourth, . . . variable.
– You are finished when there is no variable left that increases adjusted R2.
• Stage 2 (removing non-significant predictors): if any of the utilised predictors are not signif-
icant, you need to remove them. Keep removing variables until all variables in the model are
significant.
– Start by fitting a model using the variables that have been added to the model in Stage 1.
– If there is a variable that is not significant, remove the variable with the largest p-value and
fit the model again with the reduced set of variables.
– Keep removing variables and re-fitting the model until all remaining variables are signifi-
cant.
– The remaining significant variables are the output of your function.
P2.3 - Regression Analysis (15 marks)
In this part of the statistical analysis, you are asked to develop various regression models for predict-
ing the winning probability of a fighter using the significant predictors found in P2.2.
You are asked to write a function regression_models_UFC(df, main_pred, main_target) that takes
the data frame df, significant predictors main_pred and target main_target as its arguments, and
• splits the data into training and test samples with 1:1 ratio.
• fits Linear, Logistic, Probit and Bayesian regression models using the training samples, and
then predicts winning probabilities using the test samples.
regression_models_UFC() returns a single object results which is a tuple of tuples each element
of which is:
• sm. and pymc3 model objects for each regression model (lin_reg, logit_reg, poisson_reg
, bayes_reg)
• predicted probabilities for each model (y_lin, y_logit, y_poisson, y_bayes)
• splitted training and test samples (x_train, x_test, y_train, y_test)
P2.4 – Data Analytics, performance and visualisation (16 marks)
P2.4.1 - In-Fight Winning Analysis (5 marks)
Assume you are the data analyst of an UFC fighter. You have developed a Logistic regression model
above and is going to use this to make an in-Fight analysis.
During the round 4, you are creating a function in_fight_analysis(results). This function is
going to take as input the output of regression_models_UFC() function, and will give you some data
analysis insights for the last round of the fight. The function in_fight_analysis(results) will
• randomly select a fighter from the test data. (Hint: select a row, not a fighter!) Assume that this
is your fighter!
• analyse two parameters of: ’Avg_HEAD_landed’ and ’Avg_opp_CTRL_time’. (Hint: These two
parameters are two of the significant predictors. If your select_variable() function does not
return these two, you are doing something wrong.)
• create a seaborn heatmap figure that depicts how the changes on the two parameters men-
tioned above affect winning probability of your fighter.
For both of the variables, your arrays will start from your fighter’s existing values, and you are going
to check changes up to two times higher from those values.
Winning probabilities will be predicted by using the Logistic regression model developed in P2.3.
Page 10
P2.4.2 - Height-Reach Analysis (5 Marks)
Write a function height_reach_analysis(df, results) in order to analyse the effects of height and
reach differences between fighters. You are asked to:
• take as the input the dataframe df and regression modelling output object results.
• calculate height and reach differences for each specific fight.
• A unique ’Match_ID’ corresponds to two different rows in df, e.g. for the winner and the loser.
• For each pair, you need to find the differences between columns and create two new columns
with these values: ’dHeight’ and ’dReach’.
• If fighter’s values are higher, the specific difference value will be positive, otherwise you should
set it to negative values.
• An example:
Fighter Winner Height_cms Reach_cms dHeight dReach Match_ID
3132 Ray Borg 1 162.56 160.02 -2.54 -10.16 975
8947 Jussier Formiga 0 165.10 170.18 2.54 10.16 975
• filter out df for the test data and add a new column ’WinProb’ from the input argument result
belonging to Probit regression.
• plot a scatter plot where ’dHeight’ and ’dReach’ columns correspond to axes and ’WinProb’
does for the colour details of the plot. You must use plotly.express module for this question.
P2.4.3 - Prediction Performance (6 marks)
You will now need to visualise the prediction performance of the models, and evaluate them in terms
of prediction accuracy (Acc%), mean square error (MSE) and area under curve (AUC) metrics. For
this purpose, create a function prediction_perf(gt, model_predictions) which evaluates the pre-
diction performance of the reference models. Up to this point, you should have obtained
• predictions from each model, stored in model_predictions.
• The ground-truth values from data frame df, stored in gt.
Assume predicted values for a given model are stored in a variable P . The first performance measure
will be the MSE, and will be calculated for each model from the expression below:
MSE =
1
N
N−1∑
i=0
(Pi −Winneri)2
In order to obtain the prediction accuracy for each model, you need to use sklearn module and its
accuracy_score() function. Similarly, by using sklearn module methods roc_curve() and auc()
find ROC curve parameters and AUC metric for each prediction model.
In order to obtain performance analysis results in a neatly way, you then need to create a new pandas
dataframe df_results which will be in the form of
Model Acc% MSE AUC
0 Linear 77.00 0.1260 0.911
1 Logistic 81.00 0.1086 0.911
2 Probit 76.00 0.1490 0.884
3 Bayesian 77.00 0.1389 0.899
Consequently, the prediction_perf() function should print and return the data frame df_results
.
Page 11
Support for assessment
Questions about the assessment can be asked on https://stackoverflow.com/c/comsc/ and tagged
with #CMT309, or during the online session which will be held in Week 2.