R代写 - STA302H1F / 1001HF Autumn 2020 Assignment
时间:2020-10-13
STA302H1F / 1001HF Autumn 2020 Assignment # 2
A Simple Linear Model for Toronto and Mississauga House Prices
Posted by: Dr. Shivon Sue-Chee on Saturday, October 10, 2020
Due: In Quercus by 8pm on Saturday, October 24, 2020.
Late assignments will be subjected to a penalty of 20% per day late. Submissions will not
be accepted beyond 48 hours of the due date. Email submissions are not allowed.
1 Instructions
• Use R Studio to create two files:
1. An Rmarkdown file with your codes according to the standard format in the A2 a20 RMForm.Rmd
file.
2. The corresponding report in html or pdf format.
• Create a video presentation of no longer than 5 minutes with you presenting your written report.
Your face should be shown at least once during your presentation. Save your video presentation as
an MP4 file and upload it into your UofT MyMedia account.
I suggest that you use Zoom to create your video. Here are two demonstration videos of how to
record and download your presentation in Zoom:
– https://www.youtube.com/watch?v=P6cTbnUPwfY
– https://kb.siue.edu/61721
Here is documentation on using UofT MyMedia:
– https://www.oise.utoronto.ca/online/Instructors/Video_server_-_MyMedia/index.html
• Into Quercus Assignment 2, submit the following three items:
1. A MyMedia link to your video presentation
2. An Rmd file with your RMarkdown codes
3. A Portrait picture of yourself with your T-card
• Note that for a separate participation activity, which would be announced later, you would be asked
to upload your video presentation to peer Scholar for peer reviews. For the sake of privacy, please
avoid revealing your full identity in your video presentation. You could use your initials and up to
the last four digits of your student number, if you like.
• Presentation of your report is very important. Do not show R codes unless it is required for your
solutions. Only required numbers and plots should be shown. Extraneous output should be hidden.
Use options, include=FALSE, echo=FALSE, message=FALSE, where necessary.
• Write and present your own work. For instance, personalized your code as much as possible, using
your initials. All plots produced must be given a title with the last 4 digits of your student
number. • Use a benchmark significance level of 5%. Report p-values to 4 decimal places.
1
2 Grading Scheme
Grading rubrics will be posted in Quercus for the video presentation and the RMarkdown file.
Note that if a portrait picture of yourself with a clear view of your T-card is not received by the due date
or if your picture and T-card do not correspond to our other records, a mark of zero will be given for the
entire assignment.
3 The Data
First-time home buying is currently a major federal issue. Prices for detached houses have been at an alltime high during the current COVID-19 period. Data for this assignment was obtained from the Toronto
Real Estate Board (TREB) on detached hourses in two separate neighbourhoods- one in the city of Toronto
and another in the city of Mississauga. Data is contained in the file “real20.csv” on the assignment 2 page.
The variables in the dataset are:
• ID: property identification
• sold: the actual sale price of the property in millions of Canadian dollars
• list: the last list price of the property in millions of Canadian dollars
• taxes: previous year’s property tax in Canadian dollars
• location: M- Mississauga Neighbourhood, T- Toronto Neighbourhood
For this assignment, we are interested in establishing a simple linear model that home buyers can use
to determine the expected sale price of detached, single family homes in the two neighbourhoods in the
Greater Toronto Area.
4 The Analysis
Set the seed of your randomization to be the last 4 digits of your student number. Randomly
select a sample of 200 cases. Based on your sample data, complete an RMarkdown file and the
corresponding report with the following sections.
I. Exploratory Data Analysis section.
• Use a single plot to describe your data. Create a subset of your data by removing at most two
cases and briefly explain your choice.
• Then using the data subset for this and the remaining parts of this assignment, draw two scatterplots of the response variable- sale price by (i) list price and then (ii) taxes. In each plot,
distinguish between properties in neighbourhood M and those in neighbourhood T, and include
a legend/key.
• Interpret each of the three plots produced in this part, that is, describe at least one major
highlight from each plot. Each highlight should differ from the other.
II. Methods and Model section.
• Carry out three simple linear regressions (SLR) for sale price from list price, one for all data, one
for properties of neighbourhood M and another for properties of neighbourhood T. In a table,
give the values of the following for each of these regressions:
2
– R2 – the estimated intercept
– the estimated slope
– the estimate of the variance of the error term
– the p-value for the test with null hypothesis that the slope is 0
– a 95% confidence interval for the slope parameter
• Interpret and compare the three R2 values. Give a briefly explanation why they appear similar
or different.
• Briefly discuss whether a pooled two-sample t-test can be used to determine if there is a statistically significant difference between the slopes of the simple linear models for the two neighbourhoods. (Note: You do not need to carry out a pooled two-sample t- test here.)
III. Discussion and Limitations section.
A sensible data science approach is to base inferences or conclusions only on valid models.
• Select one of the three fitted models in part II and give a brief explanation for your choice.
• Discuss whether there are any violations of the normal error SLR assumptions for your selected
model. Use at most two plots.
• Identify two potential numeric predictors (other than those given in the data set) that could be
used to fit a multiple linear regression for sale price.
3