STA220H1-无代写-Assignment 2|学霸联盟

STA220H1-无代写-Assignment 2

时间：2023-11-29

STA220H1 The Practice of Statistics I (Fall 2023)
Assignment 2 Instructions
Due Date: December 1, 2023 at 11:59 on Crowdmark
Instructions
This is an individual assignment. You are expected to work on this independently. While you
may discuss ideas and concepts, please do not share your code or written answers. It is expected
that all code and written work should be written by yourself. Please note, this assignment is fairly
open, so the context of most of the work completed here should not match your peers.
Submission Format and Instructions
Your final submission will be in PDF file. You will submit your solutions on Crowdmark. There
will be a different upload box for each question, so it is recommended that you place each
question on different pages or files.
Your PDF file will need to show (1) R code, (2) R output/figures, and (3) your written answers.
Here are some suggested ways you can create your final submission:
• Use Microsoft Word to type out your answers. Screenshot your R output and place these
images throughout the document. For the R code, either copy/paste as text or screenshot.
• Use an app like Notability, OneNote, etc., where you can write/type your answers and
include screenshots of your R code and output.
• Use RMarkdown and knit to a PDF. Alternatively, you can knit to an HTML file and then
save it as a PDF.
How you create the final file is up to you, as long as it is clear and organized. You don’t want the
TA to be frustrated while marking your work!
Use of Built-In Functions in R
You are allowed to use built-in functions and packages in R. This includes functions that help
with confidence intervals and hypothesis testing. However, if you are going to use built-in
functions for intervals and tests, please be aware that not all built-in functions that we've seen in
class will give you the proper test statistic required in some questions.
Late Penalty
As described on the course syllabus, late work will be deducted 10% per hour.
Data for this Assignment
In this assignment, you will work with the Spotify dataset that you used in Assignment 1. Recall
that dataset contains a comprehensive list of the most famous songs of 2023 as listed on Spotify
as of August 2023. It provides insights into each song's attributes, popularity, and presence on
various music platforms.
The following variables are provided in the data:
• track_name: Name of the song
• artist: Name of the artist(s) of the song
• artist_count: Number of artists contributing to the song
• released_year: Year when the song was released
• released_month: Month when the song was released
• released_day: Day of the month when the song was released
• in_spotify_playlists: Number of Spotify playlists the song is included in
• in_spotify_charts: Presence and rank of the song on Spotify charts
• streams: Total number of streams on Spotify
• in_apple_playlists: Number of Apple Music playlists the song is included in
• in_apple_charts: Presence and rank of the song on Apple Music charts
• in_deezer_playlists: Number of Deezer playlists the song is included in
• in_deezer_charts: Presence and rank of the song on Deezer charts
• in_shazam_charts: Presence and rank of the song on Shazam charts
• bpm: Beats per minute, a measure of song tempo
• key: Key of the song
• mode: Mode of the song (major or minor)
• danceability_percent: Percentage indicating how suitable the song is for dancing
• valence_percent: Positivity of the song's musical content
• energy_percent: Perceived energy level of the song
• acousticness_percent: Amount of acoustic sound in the song
• instrumentalness_percent: Amount of instrumental content in the song
• liveness_percent: Presence of live performance elements
• speechiness_percent: Amount of spoken words in the song
In this assignment, you may wish to create new binary variables based on existing variables. For
example, the following code creates a new variable called “c_sharp” in the dataset
“spotify_data”. It is equal to 1 if the key is C# and 0 otherwise.
spotify_data$c_sharp <- as.numeric(spotify_data$key == 'C#')
In another example, the following code creates a new variable called “after_2020”. It is equal to
1 if the release year is after 2020 and 0 otherwise.
spotify_data$after_2020 <- as.numeric(spotify_data$released_year > 2020)
You may also wish to remove observations that have an NA (i.e., missing) in a variable of
interest. For example, this code creates a new dataset called ‘spotify_data2’ such that all the
observations where ‘c_sharp’ was missing is removed.
spotify_data2 <- spotify_data[!is.na(spotify_data$c_sharp),]
Question 1 (6 marks)
We are interested in the proportion of songs that are in a major key. Answer the following
questions using the Spotify dataset. For calculations that you complete in R, show your code and
output.
a) Test whether 50% of songs are in a major key using = 0.1. State the hypotheses,
calculate the Z- statistic and p-value using R, and state the relevant conclusion. (4 marks)
b) Compute a 99% confidence interval for the proportion of songs that are in a major key.
You may use ≈ 0.5. Interpret the interval. (2 marks)
Question 2 (6 marks)
We are interested in average beats per minute of a song. Answer the following questions using
the Spotify dataset. For calculations that you complete in R, show your code and output.
a) Test whether the average beats per minute of a song is equal to 124 bpm using = 0.05.
State the hypotheses, calculate the test statistic and p-value using R, and state the relevant
conclusion. (4 marks)
b) Compute a 90% confidence interval for the average beats per minute of a song. Interpret
the interval. (2 marks)
Question 3 (40 marks)
In this question you are going to write-up a short analysis based on the Spotify dataset.
Requirements:
• A few sentences introducing the dataset. Assume the reader does not know anything
about the dataset.
• A few sentences introducing the population of interest and the parameters of interest.
• Create at least 2 graphs related to the parameters of interest. Describe the patterns you see
in the graphs.
• Conduct at least 2 hypothesis tests. For each test, state the hypothesis, z-statistic/t-
statistic, p-value, and conclusion.
• Compute at least 2 confidence intervals. For each interval, make sure you specify the
level of confidence. Interpret each of your intervals.
• A few sentences summarizing and concluding the results of your analysis.
• An appendix that shows all code and R output.
Notes:
• Written text and graphs should appear in your write-up. All code and output (excluding
graphs) should be included in an appendix, and should not appear in the main part of your
write-up.
• Write-ups (excluding the appendix) should not exceed 500 words.
• You are welcome to filter through the data before the analysis to adjust the population of
interest.
• For the hypothesis tests and confidence intervals, you are expected to complete all the
calculations in R.
• If you explored a parameter in a hypothesis test, that parameter can still be the subject of
one of your confidence intervals.
• All writing should be in full sentences.
o In the body of the text, use full sentences to describe your test/interval. For
example, “We wish to test the null hypothesis that ____ versus the alternative
hypothesis that ____. The test statistic is ___ with a p-value of ____.”
o All calculations (R code and output) should be in the appendix.
• You are encouraged to use headings to organize your work.
Question 3 Rubric
Inadequate Fair Good Excellent
Writing Quality
(10 marks)
0-4 marks
Some written
components are
not included.
Writing is
unclear.
5-6 marks
Most written
components are
provided.
Written
components
contain major
issues. The
descriptions do
not accurately
describe the
methods.
Writing is
somewhat
unclear.
7-8 marks
All the written
components are
provided and
shows that
student is able to
properly
communicate
statistical
concepts.
Writing is
generally clear.
9-10 marks
All the written
components are
provided.
Student exceeds
expectations in
statistical
communication.
Writing is clear
and compelling.
Plots (10 marks)
0-4 marks
Does not meet
the requirement
of 2+ plots.
5-6 marks
Required plots
are provided, but
plots do not
highlight the
important
information
related to
parameters of
interest
7-8 marks
Required plots
are provided,
and mostly
shows that the
student is able to
create a plot
relevant for the
situation. Plots
are labelled
properly.
9-10 marks
Required plots
are provided,
and a lot of
thought was put
into creating the
plot. Plots are
interesting,
compelling, and
communicate
well to the
viewer.
Hypothesis Tests
and Confidence
Intervals (10
marks)
0-4 marks
Very few of the
required
hypothesis tests
and confidence
intervals are
provided.
Contains major
errors.
5-6 marks
Some of the
required
hypothesis tests
and confidence
intervals are
provided. Errors
with the set-up,
calculations,
7-8 marks
The required
hypothesis tests
and confidence
intervals are
provided.
Interpretations
are provided and
correct.
9-10 marks
The required
hypothesis tests
and confidence
intervals are
provided.
Interpretations
are provided.
Conclusions are
and/or
interpretations.
well written and
provide an
interesting
discussion to the
analysis.
Appendix, R
code (5 marks)
0-2 marks
R code is not
shown or has
many major
errors.
3 marks
R code is
somewhat
provided but is
difficult to
follow.
4 marks
R code is
provided but
contains errors
or is hard to
follow.
5 marks
R code is
provided.
Appropriate
functions and/or
calculations are
used.
Formatting and
Organization (5
marks)
0-2 marks
Poorly organized
and difficult to
follow.
3 marks
Sometimes
difficult to
follow. Code
may appear in
body of the text.
4 marks
Organized and
formatted well.
Code does not
appear in the
body of the text.
5 marks
Very well
organized and
presentable.
Code does not
appear in the
body of the text.
Proper headings
are used.