线性模型代写-QBUS 6810

时间：2022-10-28

HD EDUCATION

HD
EDUCATION

QBUS 6810
作业拓展课

TUTOR: Burger

HD EDUCATION

学科特点及学习方法

学科特点：
1. Qbus6810 是 buss6002的课程延伸，在原本回归模型基础上新增很多新模型
2. 这门课程还是 Qbus6850的前置课程，为后续学习铺垫重要理论知识和代码实践基础
3. USYD 商业分析的黄金课程之一，对于以后想要从事商业分析师或者想进入金融机构做数据分
析和风险管理的同学都非常有帮助。

学习方法
1. 有数学，但不难不要害怕，重在理解，死记硬背效果甚微
2. 课程信息量非常大，每周要好好听课。因为知识有串联，前几周是重要基础，前面听不懂后面
也很难跟得上学校的步伐。不要堆到考试才抱佛脚！！
3. 自己学会总结知识点易错点，理清思路和思维逻辑！
4. 不懂的问题最好每周解决，可以在 tut上问老师或者在邮件和 ed问问题

HD EDUCATION

1. 重要信息
• Kaggle competition ends: 11:59pm 4th Nov 2022 第 13周周五
• Report and notebook due date: 11:59pm 6th Nov 2022 第 13周周日
• Weight: 30% of your final grade
• Simple extensions: Simple extensions cannot be used for group work. More information on simple extensions
is given here.
• Special consideration: If you need to apply for special consideration, you can do so by following the links on
the special considerations page.

2. 组队
The assignment is to be completed in groups of up to 5 students. Groups can be formed across different tutorials and
across RE and CC streams. Please make sure that you have registered your group on Canvas: those groups will be used
for identification and assessment purposes. 每组最多 5个同学，记得在 Canvas登记小组信息
You are ultimately responsible for forming your own groups. If you would like to be randomly allocated to a group,
please contact qbus6810.admin@sydney.edu.au as early as possible, otherwise you may find yourself without a group.
Additionally if you are a small group and would like new members to be randomly allocated to your group, please contact
qbus6810.admin@sydney.edu.au. Groups are expected to be finalised by Sunday the 16th of October.
3. 作业介绍
Airbnb is a global platform that runs an online marketplace for renting and leasing short-term lodging. It is interested
in developing a pricing service for its users that will compute a recommended price based on the features of a listing.
As a consultant working for a data analytics company, you are approached by Airbnb to develop a model for
predicting nightly prices of Airbnb listings based on state-of-art techniques from statistical learning. The goal of your
analytics team is to predict the price per night of listings for properties in Sydney, Australia. Such information can be
used to estimate the prices of new listings or to guide new hosts in advertising their properties. Airbnb can also use
the information to identify which of their listings produce the most profit.
You are provided with a training dataset containing detailed information on a number of existing Airbnb listings in
Sydney. As part of the contract, you are asked to write a report according to the instructions given in Section 3.2.
4. 数据介绍
Data description
The data correspond to Airbnb listings in Sydney with each row corresponding to a single listing. You have been
provided with a subset of the original dataset.
~∝
⼀
0
HD EDUCATION

As a consequence of using real data scraped from Airbnb, a detailed description of all the variables is not available.
However, the names of the variables are generally self-explanatory. An incomplete data dictionary can be found on
Canvas.
The first column in the data provides an identifier for each listing and is included to comply with the Kaggle format. It
should not be used as a predictor in the analysis. The response variable, price, is the second column in the training
dataset. It gives the price per night for each listing in Australian Dollars (AUD). Variables latitude and longitude specify
the geographic location of each property. Several variables are Boolean, with true recorded as ‘t’ and false recorded
as ‘f’. Some of the listings have missing values under some of the variables. Note that in many cases a missing value
means that the corresponding characteristic does not apply to that particular Airbnb listing. This is information, rather
than lack of information, which you could make use of in your analysis.
5. Report要求
Written report
The purpose of the report is to describe, explain, and justify your solution to the client. You can assume that the client
is trained in business analytics, however, is not an expert in statistical learning.
Your report should be a maximum of 15 pages (single spaced, 11pt font). Note that the cover page, reference list and
appendix do not count towards the page limit.
o Suggested outline of the report
1. Introduction
2. Data processing
3. Exploratory data analysis
4. Feature engineering
5. Methodology
6. Validation and comparisons
7. Conclusion
More detailed information is provided in the report scaffold, which you can download from Canvas. Additionally, a
guide for the page length is provided in the marking rubric.
6. 作业要求
Requirements
1. Your report must provide the validation scores (those from the Public Leaderboard on Kaggle) for five
different sets of predictions, including your final model. These should generally be your best performing
⼀
∞
~
了
HD EDUCATION

models within the model requirements specified below. You will need to make a submission on Kaggle (see
Section 5 for instructions) to get each validation score. 报告里要提供 5个模型在 kaggle竞赛的分数
2. The five sets of predictions must come from different statistical learning methods. At least one of the five
models should to be an interpretable linear model (OLS, Lasso, etc); at least one should be an interpretable
model specified by a single regression tree; at least one should be an advanced tree-based model (bagging,
random forests or boosting); and at least one should be a model stack (or model average). 至少要制作 5个
模型，一个是线性模型，一个基础树模型，一个进阶树模型，一个 model stack，最后一个自己喜欢
3. In the methodology section you will discuss three of the five models in detail (including both the description
of the methods/algorithms and the interpretation of the estimated models). The remaining two models do
not need to be discussed in detail (you can just provide one brief descriptive sentence for each of them). 在
report里你要对 3个模型进行详细的描述（包括模型介绍，模型和原理解读，结果解读），其他两个
只需要做简单介绍。
4. One of the three models that you discuss in detail must be your final model; one of the three models is
required to be an interpretable linear model (OLS, Lasso, etc); and one is required to be an interpretable
model specified by a single regression tree. Please note that the description of the methods/algorithms for
the three models should take up at most 3 pages. 你必须讨论的 3个模型分别是：线性模型，树模型，你
kaggle选择的 final 模型，这三个模型的介绍最多 3页
5. You must pay special attention to, and report on, the relationship between the location and the price, both
during the exploratory data analysis and during the model interpretation. You must comment on the patterns
in pricing around Sydney and its constituent suburbs. As part of feature engineering, you must create (and
describe in the report) at least one new location-related variable by using the existing variables and, if you
wish, external information. 做分析的时候必须注意地点和价格的关系，也要建立一个全新的变量哦～
6. You are expected to hold at least three group meetings during the course of the assignment. You will need to
take meeting minutes as outlined in the appendix of the assignment template.

⼀
HD EDUCATION

1. Intro (0.5,5)
This is the introduction. Write a few paragraphs stating the business problem, how your report addresses the problem
and a brief summary of your results. Use plain English and avoid technical language as much as possible in this section
(it should be intended for a wide audience).
If you reference external sources make sure you cite them in your report. For example, you may refer to the textbook,
Introduction to Statistical Learning (James et al.,2013)

• A detailed summary of the business context is provided alongside a complete description of the business
problem.
• A thorough description of the contents and aim of the report is provided and describes how the report will
address the business problem.
• A concise description of the results contained in the report is provided.
介绍你这文章干了什么，说出你的 project解决他们什么问题，就是老师给我们的问题，还有你文章的目的。主
要用的方法和得到最重要的结果
2. Data processing (1,5)
• A concise and complete description of the data that is provided.
• A concise and complete description of how the data is processed is provided.
• A thorough description of how missing values are handled is provided, and the way in which missing values
are handled is appropriate
可以说一下数据大概构成，例如 interger，string，float这些，还有有一些日期数据等等,如何处理数据，怎么处
理缺失值，为什么这样做？
HD EDUCATION

3. EDA（4-5，10）
In this section you will need to describe your exploratory data analysis (EDA) process. You should provide key information
about the data, discuss potential issues, and highlight interesting and important facts about the data and the
relationships among the variables that are useful for the rest of your analysis.

You should study key variables individually and pairwise using appropriate figures and descriptive statistics. You will
need to note any features of the data that are relevant to model building. You should note the presence of outliers and
any other anomalies that can affect your analysis. It is important that you explain how your EDA relates to subsequent
feature engineering and modelling.

It is likely you will need to include figures in this section. Please make sure that all the text in your figure is readable. For
every figure that you include you should include a figure caption, which provides a brief summary of the figure and you
should refer to the figure in your report text where you provide a more detailed analysis. For example, Figure 1 shows
the due dates for the assignment. It’s important to remember that the Kaggle competition closes 2 days before your
reports are due. This is to give you extra time to work on your reports.

You should be selective with what you include in your EDA, only including results that are relevant to the report. Learning
to recognise what is important is an essential skill. Thus, we have left it to you to formulate the structure and section
headings.
Don’t forget that somewhere in this section you will need to report on the relationship between location and price.
• A thorough description of the EDA process is included and has been carefully curated to present a selection of
key results. The EDA studies key variables individually and pairwise using appropriate figures and descriptive
statistics.
• The EDA clearly identifies features in the data that are relevant to model building.
• The EDA clearly identifies outliers or anomalies that can affect analysis and provides a detailed description of
how these are handled.
• The report details the relevance of the EDA results to subsequent feature engineering and model building.
不用所有都说，挑重点的几个发现，可以分成单变量和双变量分析，需要提供图片， outlier的影响，如何
解决 outlier？这些数据给后来的模型构建有什么影响/启发？

注意：给了图片就要分析！不要用错分析技巧，不要画错图，记得讨论位置和价格的关系！

4. Feature Engineering（2，5）
In this section you should describe and justify your feature engineering process. Your choices need to be justified by
data analysis, domain knowledge, logic and trial and error (if necessary). Data- driven choices are better than opinion-
based choices. One of your engineered features must be a new location-related variable.
Don’t forget that any feature engineering you perform on the training data, you must also be performed on the test
Sone_s 分类
medwom -) 分类
HD EDUCATION

data!

• A complete and thorough description of the feature engineering process is provided. It is clear that
substantial effort has gone into designing new feature, which have been appropriately engineered.
• Choices made during the feature engineering process are well justified.
• A new location-related variable has been created using existing variables and is well justified.
然后就是你怎么做的 encoding呢？做这些新的 feature的作用？建立一个地址相关的 variable，这样建立
的原因

5. Methodology（4-5，30）
Here you will focus on describing three of your models as outlined under the Requirements section of the assignment
description and double check the requirements against the methodology section in the marking rubric. This should
include enough detail that a data scientist could reproduce your models. For example, you need to identify which
predictors and what (if any) regularisation methods you are using. You should also provide a very brief description of
the other two models.

You must clearly describe and justify the models, methods and algorithms in your analysis. You should include your
rationale for choosing the models and why they make sense for the data. The construction of your models may involve
systematic trial and error, but your report should focus on your final models.

You must provide interpretations of the estimated models (in the context of the business problem). You should also
report crucial assumptions (in Section 4.5) and whether they are potentially violated.

The description of the methods and algorithms can be more technical than the rest of the report. However, you are not
to include code. Please use your own words in the description. You may want to create different subsection under which
you describe each of your models. Note that the description of the methods/algorithms for the three models should
take up at most, 3 pages.

• The choice of models is clear and well justified.
• A correct and detailed description of the fitting process for model 1 is provided.
• A correct and detailed interpretation of model 1 with reference to the business context is provided.
• A correct and detailed description of the fitting process model 2 is provided.
• A correct and detailed interpretation of model 2 with reference to the business context is provided.
• A correct and detailed description of the fitting process model 3 is provided.
• A correct and detailed interpretation of model 3 with reference to the business context is provided.
• Brief description of model 4 is provided.
• Brief description of model 5 is provided.
• Crucial assumptions are correctly discussed with reference to the business context.

需要描述五个模型里的三个（线性模型，树模型——高阶树模型，model average），其他两个简单介绍就可
HD EDUCATION

以。主要是讲模型，哪个模型表现比较好，为什么，可以像观众解释一下模型干什么的，模型原理怎样的，
模型参数怎么出来的，可不可信？但不要太数学，但建议包含模型形式（数学表达式）要说你对模型的理
解。模型有什么假设，模型建立有没有违反假设？

6. Model validation and comparisons（0.5-1，7）
You may want to provide your results in a table. For example, Table 1 provides the RMSE for
both the training data and the validation data.
You should discuss whether your validation results are consistent with your expectations for each model. For example,
do more complex models perform better than your linear regression model?
In this section you should also provide a comparison of the performance of your models. You should compare the
performance of your models on the raw validation scores, and also compare the interpretability and complexity of each
model. Additionally, comment on the limitations of each model. Remember to relate everything back to the business
context.

• The report includes training and validation scores (which have been correctly interpreted from Kaggle).
• A detailed and insightful comparison of the models is provided with reference to the business context.
• A detailed and insightful analysis on the limitations of the models is provided with reference to the business
context.
可以做一个表显示模型的预测能力，对比不同的模型，适用场合？假设对比？模型的缺点？

7. Conclusion（0.5，3）
• A brief and concise summary of the work contained in the report is provided and includes all key findings
• The conclusion clearly relates to the business context.
得到什么商业结论？

文章还需要包含：References，Statement of Contribution，meeting minutes

7. Kaggle competition

You will participate in the Kaggle competition that will be run on www.kaggle.com. This competition will allow you to
incorporate feedback into your model building process and compare your performance with that of other groups.
Participation in the competition is part of the assessment, so please make sure that your final submission is correct. Your
ranking in the competition will affect your mark.

You will need to create a Kaggle account, identifiable by your name, to access the competition and make submissions. Please
note that you can significantly simplify your registration with Kaggle by using social logins (Facebook, Yahoo, Google) to sign
in. Those options are available on the Kaggle sign-in page. After you have created an account and logged into Kaggle, you
HD EDUCATION

should be able to access the competition here (you need to be logged in to get to the competition page via the link). For
convenience, this link has also on the Canvas Assignment page.

On this page you will click on the ‘Join Competition’ link, located in a dark box near the top right corner of the page. After you
accept the competition rules, you will have joined the Kaggle competition for the group project. Each group will need to create
a team on Kaggle. The group leader can create a team by joining the competition and then going into the ‘Team’ tab, which
will appear near the top of the competition page. The leader can then invite other group members using their Kaggle names
(they need to first join the competition before they are able to be invited). Kaggle team composition must be identical to that
of the groups you formed on Canvas, and the team number must match the group number. Each student in the group is
required to sign up and be identifiable as a member of a Kaggle team. 每个同学都要建立账户进入自己的小组！

Kaggle randomly splits (just once) the listings in the test.csv file into validation (50%) and test (50%) cases, but you will not
know which ones are which. When you make a submission during the Kaggle competition, you get a score equal to the RMSE
computed on the validation listings. These scores are displayed on the ‘Public Leaderboard’ and provide an ongoing ranking
of teams. You can use the scores of your submissions to help you select the best predictive model.

You will need to manually select one of your Kaggle submissions to be used as your final model at the end of the competition.
Once the competition is over, Kaggle will rank teams’ final submissions based on the test cases only, and those will be
displayed on the ‘Private Leaderboard’. Your goal is to do as well as possible on the Private Leaderboard at the end of the
competition, so please be careful not to overfit the validation cases in an attempt to improve your public ranking. Please note
that the competition ends at 11:59pm on the 4th of November, which is exactly 2 days before the due time for the assignment
report.

8. 交作业
Submission details
• Written report (one .pdf file per group)
• Jupyter notebook (one .ipynb notebook per group)
Your report and notebook files should be named:
• QBUS6810 GroupXXX report.pdf
• QBUS6810 GroupXXX notebook.ipynb
where XXX is your group number. For example, if you were group 32, this would be Group032. Your assignment should be
submitted on Canvas. To find the submission page go to Modules ：Group Assignment. You may submit multiple times but only
your last submission will be marked.