FIT5145-R代写-Assignment 3|学霸联盟

FIT5145-R代写-Assignment 3

时间：2023-05-31

Faculty of Information Technology
Semester 1, 2023
FIT5145: Introduction to Data Science
Assignment 3: Description
Sunday, Week 14 (June 11, 2023) 11:55 PM
Assessment Details:
● Assessment Type: Individual Assignment
● Total marks: 40%
● Due Date: Sunday, Week 14 (June 11, 2023) 11:55 PM. Please notice that we do not
accept submissions after June 18, 2023 (i.e., 7 days after the due date).
Hand in Requirements:
In this assignment, two files (PDF report and RMD file) should be submitted.
1. A report in PDF containing your (a) code, (b) answer, and (c) explanation used to
answer each question. Please make sure that your answers to all the questions are
numbered correspondingly.
(a) code: Make sure to include all the shell commands for Task A and the R codes
for Tasks B-D in the PDF report. For the shell commands, please copy your codes
and paste into Word or other word processing software (Please do NOT take the
screenshots of your code).
For the R codes, there are two ways to include your codes in PDF. If you use
Word or other word processing software to format your submission, please copy
your codes from the RMD file and paste into Word (Please do NOT take the
screenshots of your code, which may result in the deduction of marks allocated
for codes). Or you can directly convert the RMD file including your codes into
the PDF file.
(b) answer: Please make sure to include screenshots/images of the code outputs
and written answers for each question of Tasks A-D in PDF.
(c) explanation: Please explain how you answered each question (i.e., explaining
your codes or summarising your work for each question).
Marks will be assigned to reports based on their correctness and clarity. For example,
higher marks will be given to reports containing graphs with appropriately labelled
axes.
2. The RMarkdown file: Please submit the RMarkdown file that contains your R codes
for Tasks B-D of this assignment. Your file should contain all the codes, proper
comments, and any instructions of libraries that need to be installed.
Notes:
1. Whenever a question asks for a certain value, your code should produce the value. For
example, when a question asks for the number of rows contained in a table, your code
should print out the answer. Extraction of the answer manually will not earn any
marks.
2. Assignment should be submitted in two files (PDF report and RMD file):
(a) Failing to submit one of the two files will result in losing 20% of the total mark
of this assignment
(b) An RMD file that generates errors when running will not be considered
3. Please do NOT zip your submission files. Zip file submission will have a penalty of
20% of the total mark of the assignment.
4. Please make sure that you can select and highlight texts in your PDF, as shown
below then the turnitin score can be generated properly for your PDF file (we just
need the Turnitin score for the PDF file, not the RMD file).
5. Generative AI Use: In this assessment, you must not use generative artificial
intelligence (AI) to generate any materials or content in relation to the assessment
task.
Task A: Shell commands
In this task, you are required to explore and wrangle the data in the file
“consumer_complaints.csv” provided by Consumer Financial Protection Bureau. The file
contains consumer’s complaints about financial products and services to companies, and
companies’ responses to the complaints. In the file, there are different variables to describe
each complaint about the products and services, as described below. Please refer to this link
to get more information about the data.
Column Name Description
Date_received The date of the complaint received
Product The type of product the consumer identified in the complaint
Sub-product The type of sub-product the consumer identified in the complaint
Issue The issue the consumer identified in the complaint
Sub-issue The sub-issue the consumer identified in the complaint
Consumer complaint
narrative
Consumer complaint narrative is the consumer-submitted
description of "what happened" from the complaint
Company public
response
Companies can choose to select a response from a pre-set list of
options that will be posted on the public database
Company The complaint is about this company
State The state of the mailing address provided by the consumer
ZIP code The mailing ZIP code provided by the consumer
Tags Data that supports easier searching and sorting of complaints
submitted by or on behalf of consumers.
Consumer consent
provided?
Identifies whether the consumer opted in to publish their
complaint narrative.
Submitted via How the complaint was submitted
Complaint ID The unique identification number for a complaint
Please note that you are only allowed to use shell commands for Task A, as you would run
in Linux shell, Mac terminal, or Cygwin, to tackle this task. Using other utilities or tools such
as PowerShell is NOT allowed.
1. The first step is to explore the contents of the file. Display the first 5 data records and
the last 5 data records of the data file. When displaying the data records, include a
header. Please note that a header is not considered as a data record.
2. Display the size of the file and number of lines inside the file. The size should be
displayed in MB.
3. Display the number of variables (i.e., the columns) in the data and display the column
names.
4. What is the date range of the collected data records? Please note that the file is not
guaranteed to be sorted and Nulls (NA and empty values) should not be considered.
5. How many times was “Student loan” mentioned in the column
Consumer_complaint_narrative? When was the term “Student loan” first mentioned
in the column Consumer_complaint_narrative? Please note that the term to be
searched is not case sensitive.
6. How many unique products are there in the product column? Display the top 5 most
frequent product values in the dataset (i.e., the top 5 products with the largest number
of complaints)? Please also display the 5 least frequent product values in the dataset.
7. How many times does each of the two words, “fraud” and “account” appear in the
column narrative of the dataset? (ignoring cases). If a term appears twice in a
narrative, it should be counted twice.
How many narratives include both of these two terms? (ignoring cases).
8. Which state has the highest number of complaints that include the term “account” in
the column Issue?
9. In the following, let’s focus on certain columns contained in the file. Please only keep
columns that are in the following list:
○ Date_received
○ Product
○ Consumer_complaint_narrative
○ State
○ Submitted_via
In particular, only keep the complaints satisfying the following: (i) whose
Date_received value is after 2015; (ii) whose Product value is “Mortgage”; (iii)
whose State value is “FL”, and (iv) whose Submitted_via value is “Web”. How many
consumer complaint narratives are left? Export the selected data to a new file named
“filtered_narratives.csv” and display its first 5 data records (i.e., a header + 5 data
records). Please make sure that the csv file contains the filtered data and column
names requested.
Task B: Data Collection and Exploratory Data Analysis Using R
There are many ways to collect data from different sources. One of them is web scraping. In
this task, you are required to scrape data from two different websites, wrangle data scrapped
if required, and visualise them. Please see the detailed instructions below:
1. Scrape data contained in table format in two different websites. Data scraped in each
website has to be different to each other in terms of business domain.
2. Wrangle data if required.
3. Create a plot for each website data scraped. The created plot for each website data has
to be a different type of chart (e.g., bar chart and line chart)
4. Discuss the two charts and discuss the information or insights that can be drawn from
the charts.
Note: Please refer to Week 3 lab activity material for web scraping.
Task C: Exploratory Data Analysis Using R
Are you interested in buying a property in Melbourne? Have you realised that the rent and
home prices have seen significant increases over the past year? In this task, you are required
to perform exploratory data analysis on the data in the file
“property_transaction_victoria.csv”, which contains most of the property transactions that
took place in Greater Melbourne between 2010 and 2023. The data was collected from one of
top real estate websites. The file contains different variables to describe each collected
transaction record, as described below.
Column Name Description
id The unique ID of a transaction record, which usually consists of
9 digits
badge Whether a property is for rent or buy or it’s already sold
url URL of a property
suburb The suburb where a property locates in
state The state where a property locates in
postcode The postcode of a property
short_address Short address of the property
full_address Full address of the property
property_type Whether a property is a House, Townhouse, etc.
price The price for which the property was sold
bedrooms The number of bedrooms that a property has
bathrooms The number of bathrooms that a property has
parking_spaces The number of parking spaces that a property has
building_size The building size of a property
building_size_unit The unit of building size of a property (measured in square
metres)
land_size The area size of a property
land_size_unit The unit of the area size of a property (measured in square
metres)
listing_company_id Real estate agent who managed the transaction
listing_company_name Name of a real estate agent
listing_company_phone Phone number of a real estate agent
aution_date Date of a property auction
available_date Available date that a buyer can move into a property
sold_date The date on which the transaction was made
description A textual description that real estate agents used to describe the
property and attract potential buyers before the transaction was
made.
images Url of a property images
images_floorplans Url of a property floor plan images
listers List of the real estate agent information
inspections Inspection date on a property
Please note that you are only allowed to use R studio for Task C.
1. Read the dataset in the file “property_transaction_victoria.csv” and display the
dimension of the dataset with a proper output message. Please make sure that the
whole dataset only contains victoria property transactions.
2. Some data might be unnecessary for the following analysis. Remove the following
columns: “badge”, “url”, “building_size_unit”, “land_size_unit”,
“listing_company_id”, “listing_company_phone”, “auction_date”, “available_date”,
“images”, “images_floorplans”, “listers”, and “inspections” and print out the first 5
data records.
3. Sometimes, it could be useful to learn about the maximum, minimum, mean and
median price of different types of properties in a specific suburb, especially if you are
interested in buying a property in the suburb. Display the maximum, minimum, mean,
and median price of Apartment, House, Townhouse and Unit for each of the
following suburbs: (i) Clayton; (ii) Mount Waverley; (iii) Glen Waverley; and (iv) one
suburb that you are interested in. (Note: NA is not a value). Please describe the
distribution of property prices, based on the output.
4. As you probably realise, some data values are missing in the file. It is important to
know how much data is missing. Display the number of missing values and the
percentages of missing values for each of the columns.
5. A starting point to understand the prosperity of the property market is to analyse the
number of transactions made across different time variables. To this end, create the
different time variables, which are year, month, day of the week and day of the month
decomposed from the column “sold_date”, and store this information by adding new
columns named “year” ,”month”, “wday”, and “day”.
6. Print out the first and last date in the column “sold_date”. As part of exploratory data
analysis (EDA), produce charts to see yearly, monthly, day of week and daily trends
of property transactions in the dataset and discuss the charts. For example, the
monthly trend chart can be drawn by computing the average number of transactions
on properties for each month over years.
7. Produce a chart to visualise the number of transactions on properties of the following
types, Apartment, House, Townhouse and Unit, made across different months in 2022.
What observations can you make?
8. Counting the number of transactions is an important way to understand the prosperity
of the property market and another is to analyse the total amount of money these
transactions account for.
What is the total amount of prices and the average price for transactions made on each
type of property (Apartment, House, Townhouse and Unit) across different months in
2022? What observations can you make?
9. Now let’s zoom in the transactions made in different suburbs.
9.1. What are the top 10 suburbs with the largest number of transactions made in
2022 (including all types of properties).
9.2. For each of these top 10 suburbs, what is the most frequent type of property in
the transactions?
9.3. Can you draw a stacked bar chart to visualise the number of transactions
related to each type of property made in these top 10 suburbs in 2022? What
do you observe?
10. Let’s move forward to analyse the characteristics of the Houses sold in the following
suburbs: (i) Kew; (ii) South Yarra; (iii) Caulfield; (iv) Clayton; (iv) Glen Waverley;
(v) Burwood; and (vi) a suburb that is of your interest (e.g., the suburb that you
currently live in).
10.1. For the houses sold in each of the suburb mentioned above, calculate the
average value of the following five variables:
■ bedrooms
■ bathrooms
■ parking_spaces
■ land_size
■ price
Before calculating the average values, please first filter out transaction records
with missing values in “area” or “price”. Then, for the missing values in
“bedrooms/bathrooms/parking_spaces”, please impute them with the
corresponding median value of “bedrooms/bathrooms/parking_spaces” in
each suburb.
What observations can you make?
10.2. Calculate the Pearson Correlation coefficient for any pair of the five variables
mentioned above based on ALL the Houses transaction records made in the
first six suburbs mentioned above (Kew, South Yarra, Caulfield, Clayton, Glen
Waverley, and Burwood). Which of the Pearson Correlation coefficients are
statistically significant? (Please look at this resource for more information).
How would these significant coefficients be interpreted?
11. To get a better understanding of how real estate agents attempt to attract potential
buyers, let’s analyse the text written in the columns “description”. For this question,
please do not include any NA value and please remove any html tag such as
.
11.1. Calculate the length of the text contained in each description (measured in
characters) and produce a bar chart to visualise the number of transactions of
the following length groups:
■ [1, 500]
■ [501, 1000]
■ [1001, 1500]
■ [1501, 2000]
■ [2001, 2500]
■ >= 2500
11.2. Then,
■ Calculate the average price of the properties with description length
falling into the groups specified above.
■ Which description length group has the maximum average price?
■ Does there exist any relationship between the price of a property and
its description length? Write code to calculate the Pearson Correlation
coefficients between the properties’ price and their description length,
and interpret the results.
Task D: Predictive Data Analysis Using R
People’s verbal and written language (e.g.: blog, Ed forum on Moodle) can provide rich
information about their beliefs, fears, thinking patterns, social relationships, and personalities.
Text analysis has been widely used to analyse natural language (text) and gain insights, and
there have been many studies in the field of text analysis. We introduce one of approaches,
LIWC (Linguistic Inquiry and Word Count), which extracts the various emotional, cognitive,
and structural components from language.
In this task, you are required to analyse posts generated by students in the Moodle discussion
forums of units at Monash University. Specifically, you need to apply machine learning
models to classify different types of forum posts, namely content-relevant vs.
content-irrelevant (i.e., whether the post content is related to knowledge taught in a course).
In Particular, we have engineered a set of features (mostly via LIWC) for each forum post
which can be used to empower the training of a machine learning model.
The detailed instructions are explained in the following questions.
You can download the data files, “forum_liwc_train.csv”, and “forum_liwc_test.csv” from
Moodle. Please refer to the table below to know the meaning of each feature/column in the
data files.
Column
ID
Column name
(Column index) Description
1 Unique_ID (Column A) Unique ID of a forum post selected from the discussionforum of a unit at Monash University
2 Features (Columns B-CP)
Features generated by applying LIWC, which is a
transparent text analysis program that counts words in
psychologically meaningful categories. The detailed
explanations of the features can be found in Table 4 of
the PDF document “Manual_LIWC.pdf” on Moodle.
3 unit_faculty (Column CQ) Faculty from which the unit belongs to
4 demographic_sex (ColumnCR) Sex information of the student who made the forum post
5 label
Label to be predicted for the post. There are two classes
in the label, i.e., content-relevant (being related to the unit
content, e.g., "what is data management") and
content-irrelevant (not being related to the unit content).
Note: Values in the features (Columns B-CP) have been generated by the LIWC program and
you can interpret them as for larger feature values, they appear more frequently in the text.
1. You are required to build Machine Learning models to determine whether a post is
related to the unit content or not (i.e., predicting the label column values in the
dataset). Here, the Columns B-CR listed in the datasets (i.e., from Unique_ID to
demographic_sex) are the independent variables, and the column label is the
dependent variable. We have the training dataset (“forum_liwc_train.csv”) to train
machine learning models and have the testing dataset (“forum_liwc_test.csv”) to test
our models.
a. Build a classification tree model by taking all the independent variables
(denoted as Model 1) and evaluate the performance of the model using the
relevant evaluation metrics you learned in class and using the training dataset.
Describe how you represent the features unit_faculty and demographic_sex
before using them as part of the input, and explain how well the model fits the
training dataset.
b. Do you think whether the features unit_faculty and demographic_sex should
be used as part of the input to train a model? Why or why not?
c. Now we want to improve the performance of Model 1 (i.e., to get a more
accurate model). For example, you may try some of the following methods to
improve a model:
● Select a subset of the variables (especially the important ones in your
opinions) as input to empower a machine learning model.
● Deal with errors (e.g.: Filtering out data outliers, dealing with missing
values).
● Rescale data (i.e., bringing different variables with different scales to a
common scale).
● Transform data (i.e., transforming the distribution of variables).
● Consider interaction terms that we learnt in week 7 tutorial material.
● Try other machine learning algorithms that you know.
Please build the predictive models by trying some of the above methods or
some other methods you can think of and evaluate the performance of the
models, by measuring F1 score for the testing dataset and report which model
is the most accurate.
You need to explain the model building process (data preparation, feature
engineering, model design/building, and model evaluation) by including
codes, outputs, and explanation (explain code or explain process) and justify
why you choose some of the above methods or some other methods to
improve a model (e.g., why this subset of the variables are chosen to build a
model). Marks will be given, based on the depth of investigation required to
improve a model, as well as the sufficient justification provided for the
proposed approaches.