QBUS6810 Statistical Learning and Data Mining
Group Assignment
October 11, 2022
Contents
1 Key information 1
1.1 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Problem description 2
3 Datasets 2
3.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Written report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 Suggested outline of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.4 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 Jupyter notebook 4
5 Kaggle competition 4
5.1 Kaggle marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 Real world relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6 Submission details 5
6.1 Required submissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.2 Late submissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
7 Academic Integrity 6
1 Key information
Kaggle competition ends: 11:59pm 4th November 2022
Report and notebook due date: 11:59pm 6th November 2022
Weight: 30% of your final grade
Simple extensions: Simple extensions cannot be used for group work. More information on
simple extensions is given here.
Special consideration: If you need to apply for special consideration, you can do so by following
the links on the special considerations page.
1
1.1 Groups
The assignment is to be completed in groups of up to 5 students. Groups can be formed across
different tutorials and across RE and CC streams. Please make sure that you have registered your
group on Canvas: those groups will be used for identification and assessment purposes.
You are ultimately responsible for forming your own groups. If you would like to be randomly
allocated to a group, please contact qbus6810.admin@sydney.edu.au as early as possible, otherwise
you may find yourself without a group. Additionally if you are a small group and would like new
members to be randomly allocated to your group, please contact qbus6810.admin@sydney.edu.au.
Groups are expected to be finalised by Sunday the 16th of October.
2 Problem description
Airbnb is a global platform that runs an online marketplace for renting and leasing short-term
lodging. It is interested in developing a pricing service for its users that will compute a recommended
price based on the features of a listing. As a consultant working for a data analytics company, you
are approached by Airbnb to develop a model for predicting nightly prices of Airbnb listings based
on state-of-art techniques from statistical learning. The goal of your analytics team is to predict
the price per night of listings for properties in Sydney, Australia. Such information can be used to
estimate the prices of new listings or to guide new hosts in advertising their properties. Airbnb can
also use the information to identify which of their listings produce the most profit.
You are provided with a training dataset containing detailed information on a number of existing
Airbnb listings in Sydney. As part of the contract, you are asked to write a report according to the
instructions given in Section 3.2.
3 Datasets
You have been provided with the following datsets, which can be downloaded from Canvas.
• train.csv: for training and validating your models.
• test.csv: for making predictions.
• sample submission.csv: your predictions must be in the same format as this file.
3.1 Data description
The data correspond to Airbnb listings in Sydney with each row corresponding to a single listing.
You have been provided with a subset of the original dataset.
As a consequence of using real data scraped from Airbnb, a detailed description of all the variables
is not available. However, the names of the variables are generally self-explanatory. An incomplete
data dictionary can be found on Canvas.
The first column in the data provides an identifier for each listing and is included to comply
with the Kaggle format. It should not be used as a predictor in the analysis. The response variable,
price, is the second column in the training dataset. It gives the price per night for each listing in
Australian Dollars (AUD). Variables latitude and longitude specify the geographic location of each
property. Several variables are Boolean, with true recorded as ‘t’ and false recorded as ‘f’. Some
of the listings have missing values under some of the variables. Note that in many cases a missing
value means that the corresponding characteristic does not apply to that particular Airbnb listing.
This is information, rather than lack of information, which you could make use of in your analysis.
2
3.2 Written report
The purpose of the report is to describe, explain, and justify your solution to the client. You
can assume that the client is trained in business analytics, however, is not an expert in statistical
learning.
Your report should be a maximum of 15 pages (single spaced, 11pt font). Note that the cover
page, reference list and appendix do not count towards the page limit.
3.3 Suggested outline of the report
1. Introduction
2. Data processing
3. Exploratory data analysis
4. Feature engineering
5. Methodology
6. Validation and comparisons
7. Conclusion
More detailed information is provided in the report scaffold, which you can download from
Canvas. Additionally, a guide for the page length is provided in the marking rubric.
3.4 Requirements
1. Your report must provide the validation scores (those from the Public Leaderboard on Kaggle)
for five different sets of predictions, including your final model. These should generally be your
best performing models within the model requirements specified below. You will need to make
a submission on Kaggle (see Section 5 for instructions) to get each validation score.
2. The five sets of predictions must come from different statistical learning methods. At least one
of the five models should to be an interpretable linear model (OLS, Lasso, etc); at least one
should be an interpretable model specified by a single regression tree; at least one should be
an advanced tree-based model (bagging, random forests or boosting); and at least one should
be a model stack (or model average).
3. In the methodology section you will discuss three of the five models in detail (including both
the description of the methods/algorithms and the interpretation of the estimated models).
The remaining two models do not need to be discussed in detail (you can just provide one brief
descriptive sentence for each of them).
4. One of the three models that you discuss in detail must be your final model; one of the three
models is required to be an interpretable linear model (OLS, Lasso, etc); and one is required to
be an interpretable model specified by a single regression tree. Please note that the description
of the methods/algorithms for the three models should take up at most 3 pages.
5. You must pay special attention to, and report on, the relationship between the location and
the price, both during the exploratory data analysis and during the model interpretation. You
must comment on the patterns in pricing around Sydney and its constituent suburbs. As
part of feature engineering, you must create (and describe in the report) at least one new
location-related variable by using the existing variables and, if you wish, external information.
6. You are expected to hold at least three group meetings during the course of the assignment.
You will need to take meeting minutes as outlined in the appendix of the assignment template.
3
4 Jupyter notebook
You must provide a Jupyter notebook containing all of the relevant code used to produce the results
in your report. The notebook should be well formatted and easy to understand. A notebook scaffold
has been provided for you on Canvas.
Once you are ready to submit your notebook, you can used Ed to check that your notebook runs
without error.
5 Kaggle competition
You will participate in the Kaggle competition that will be run on www.kaggle.com. This compe-
tition will allow you to incorporate feedback into your model building process and compare your
performance with that of other groups. Participation in the competition is part of the assessment,
so please make sure that your final submission is correct. Your ranking in the competition will affect
your mark.
You will need to create a Kaggle account, identifiable by your name, to access the competition
and make submissions. Please note that you can significantly simplify your registration with Kaggle
by using social logins (Facebook, Yahoo, Google) to sign in. Those options are available on the
Kaggle sign-in page. After you have created an account and logged into Kaggle, you should be able
to access the competition here (you need to be logged in to get to the competition page via the link).
For convenience, this link has also on the Canvas Assignment page.
On this page you will click on the ‘Join Competition’ link, located in a dark box near the top
right corner of the page. After you accept the competition rules, you will have joined the Kaggle
competition for the group project. Each group will need to create a team on Kaggle. The group
leader can create a team by joining the competition and then going into the ‘Team’ tab, which will
appear near the top of the competition page. The leader can then invite other group members using
their Kaggle names (they need to first join the competition before they are able to be invited).
Kaggle team composition must be identical to that of the groups you formed on Canvas, and the
team number must match the group number. Each student in the group is required to sign up and
be identifiable as a member of a Kaggle team.
Kaggle randomly splits (just once) the listings in the test.csv file into validation (50%) and test
(50%) cases, but you will not know which ones are which. When you make a submission during the
Kaggle competition, you get a score equal to the RMSE computed on the validation listings. These
scores are displayed on the ‘Public Leaderboard’ and provide an ongoing ranking of teams. You can
use the scores of your submissions to help you select the best predictive model.
You will need to manually select one of your Kaggle submissions to be used as your final model at
the end of the competition. Once the competition is over, Kaggle will rank teams’ final submissions
based on the test cases only, and those will be displayed on the ‘Private Leaderboard’. Your goal
is to do as well as possible on the Private Leaderboard at the end of the competition, so please be
careful not to overfit the validation cases in an attempt to improve your public ranking. Please note
that the competition ends at 11:59pm on the 4th of November, which is exactly 2 days before the
due time for the assignment report.
4
Rank Mark
1 10
20 9
40 7.5
80 5
160 0
Table 1: Examples of rankings in the Kaggle competition and their corresponding awarded mark
(out of 10). This table assumes that there are a total of 160 groups, which may change as the group
registration is finalised.
5.1 Kaggle marks
Your ranking in the Kaggle competition in the private leaderboard will count towards 10% of this
assignment. We will look at the rank of your group and your mark will be:
mark = 10× number of groups− your rank + 1
number of groups
(1)
rounded to the nearest half mark. Examples of mark calculations from rank are given in Table
5.1.
5.2 Real world relevance
The ability to perform in a Kaggle competition is highly valued by employers. Some employers go
as far as to set up a Kaggle competition just for recruitment.
6 Submission details
6.1 Required submissions
• Written report (one .pdf file per group)
• Jupyter notebook (one .ipynb notebook per group)
Your report and notebook files should be named:
• QBUS6810 GroupXXX report.pdf
• QBUS6810 GroupXXX notebook.ipynb
where XXX is your group number. For example, if you were group 32, this would be Group032.
Your assignment should be submitted on Canvas. To find the submission page go to Modules >
Group Assignment. You may submit multiple times but only your last submission will be marked.
6.2 Late submissions
In accordance with University policy, these penalties apply when written work is submitted after
11:59pm on the due date:
• Deduction of 5% of the maximum mark for each calendar day after the due date.
• After ten calendar days late, a mark of zero will be awarded.
5
7 Academic Integrity
We take academic integrity issues seriously in QBUS6810. If you are suspected of dishonest behaviour
you will be referred to the Academic Integrity Office who will process your case. This may result in
delayed results, mark reduction, failure of the unit or expulsion.
Please refer to University policy for more details.