Python 代写-QBUS6810

时间：2021-06-02

BUSINESS SCHOOL Page 1 of 5 QBUS6810 Statistical Learning and Data Mining Semester 1, 2021 Group Project: Airbnb Pricing Predictions 1. Key information Required submissions: • Team responsibilities outline (one pdf file per group; Canvas submission tool will be made available in Week 9; due by the end of the day on May 17) • Written report (one pdf file per group) • Kaggle predictions (via www.kaggle.com, please see Section 5 for more information) • Python code (one file per group). Submission instructions for the report and the code will be posted on Canvas in Week 12. Deadline for submitting the written report and the code is Friday, June 4 at 5PM. Weight: 30% of your final grade. Groups: Complete the assignment in groups of four or five students. Make sure to sign into your group on Canvas: those groups will be used for identification and assessment purposes. Length: Your written report should have a maximum of 15 pages (single spaced, 11pt, cover page and references not counted towards the maximum). Marking and key rules: • A separately posted rubric indicates the marking criteria for the report. • Please read the requirements for each part of the assignment carefully. • Please follow any further instructions announced on Canvas, particularly for submissions. • You must use Python for this assignment. It is OK to use Excel for data manipulation, however, this approach is generally not recommended due to its inefficiency. • The predictions on Kaggle must come from your own analysis in Python. An examination of some of the code will be conducted for verification purposes. BUSINESS SCHOOL Page 2 of 5 2. Problem description Airbnb (www.airbnb.com) is a global platform that runs an online marketplace for renting and leasing short-term lodging. It is interested in developing a pricing service for its users that will compute a recommended price based on the features of a listing. As a consultant working for a data analytics company, you are approached by Airbnb to develop a model for predicting nightly prices of Airbnb listings based on state-of-art techniques from statistical learning. The focus of your analytics team is on the properties in Sydney, Australia. You are provided with a training dataset containing detailed information on a number of existing Airbnb listings in Sydney. As part of the contract, you are asked to write a report according to the instructions given below. The client will use a test set to evaluate your work. 3. Understanding the data A training dataset (train.csv) and a test dataset (test.csv) are posted on Canvas (the same files are also posted on Kaggle). The test dataset omits the price values. Data Description: Each row corresponds to a separate Airbnb listing in Sydney. As a consequence of using real data scraped from Airbnb, a detailed description of all the variables is not available. However, the names of the variables are self-explanatory. The first column in the data provides an identifier for each listing and is included to comply with the Kaggle format. It should not be used as a predictor in the analysis. The response variable, price, is the second column in the training dataset. It gives the price per night for each listing in Australian Dollars (AUD). Variables security_deposit, cleaning_fee and extra_people are also measured in AUD and correspond to surcharges. Variables latitude and longitude specify the geographic location of each property. Several variables are Boolean, with the word true recorded as “t” and false recorded as “f”. Some of the listings have missing values under some of the variables. Note that in many cases a missing value means that the corresponding characteristic does not apply to that particular Airbnb listing. This is information, rather than lack of information, and you could make use of this information in your analysis. 4. Written report The purpose of the report is to describe, explain, and justify your solution to the client. You can assume that the client is trained in business analytics, however, is not an expert in statistical learning. BUSINESS SCHOOL Page 3 of 5 Suggested outline of the report: 1. Introduction: write a few paragraphs stating the business problem and summarising your final solution and highlighting your key insights. Use plain English and avoid technical language as much as possible in this section (it should be for a wide audience). 2. Data processing and exploratory data analysis: provide key information about the data, discuss potential issues, and highlight interesting and important facts about the data and the relationships among the variables that are useful for the rest of your analysis. 3. Feature engineering: describe and justify your process of feature engineering. 4. Methodology (model building): here you will focus on the three models as outlined below (your rationale for choosing the models and why they make sense for the data, description of how these models are fitted, interpretations of the estimated models in the context of the business problem at hand). The description of the methods and algorithms can be more technical than the rest of the report (however, please use your own words in the description). 5. Validation scores from Kaggle (see requirements below) and comparison of the models. 6. Conclusions and final remarks (non-technical). Requirements: • Your report must provide the validation scores (those from the Public Leaderboard on Kaggle) for five different sets of predictions, including your final model. These should generally be your best performing models within the model requirements specified below. You will need to make a submission on Kaggle (see Section 5 for instructions) to get each validation score. • The five sets of predictions should come from different statistical learning methods. At least one of the five models should to be an interpretable linear model (OLS, Lasso, etc); at least one should be an interpretable model specified by a single regression tree; at least one should be an advanced tree-based model (bagging, random forests or boosting); and at least one should be a model stack (or model average). • In the methodology section you will discuss three of the five models in detail (including both the description of the methods/algorithms and the interpretation of the estimated models). The remaining two models do not need to be discussed in detail (you can just provide one brief descriptive sentence for each of them). • One of the three models that you discuss in detail must be your final model; one of the three models is required to be an interpretable linear model (OLS, Lasso, etc); and one is required to be an interpretable model specified by a single regression tree. Please note that the description of the methods/algorithms for the three models should take up at most 3 pages. • You will pay special attention to and report on the relationship between the location and the price, both during the exploratory data analysis and during the model interpretation. You will comment on the patterns in pricing around Sydney and its constituent suburbs. As part of feature engineering, you will create (and describe in the report) at least one new location-related variable by using the existing variables and, if you wish, external information. BUSINESS SCHOOL Page 4 of 5 5. Kaggle Competition You will participate in the Kaggle competition that will be run on www.kaggle.com. This competition will allow you to incorporate feedback into your model building process and compare your performance with that of other groups. Participation in the competition is part of the assessment, so please make sure that your final submission is correct. Your ranking in the competition will typically not directly affect your marks (apart from the bonus marks and the benchmark requirement, as explained below), however, we will assess whether your participation represents a genuine effort to make good predictions and improve them (please make sure to beat the “Benchmark” score on the Public Leaderboard). You will need to create a Kaggle account, identifiable by your name, to access the competition and make submissions. Please note that you can significantly simplify your registration with Kaggle by using social logins (Facebook, Yahoo, Google) to sign in. Those options are available on the Kaggle sign-in page. After you have created an account and logged into Kaggle, use the following link to get to the competition page (you need to be logged in to get to the competition page via the link): https://www.kaggle.com/t/932020c58110783854baf5a0f6931377 On this page you will click on the “Join Competition” link, located in a dark box near the top right corner of the page. After you accept the competition rules, you will have joined the Kaggle competition for the group project. Each group will need to create a team on Kaggle. The group leader can create a team by joining the competition and then going into the “Team” tab, which will appear near the top of the competition page. The leader can then invite other group members using their Kaggle names (they need to first join the competition before they are able to be invited). Kaggle team composition must be identical to that of the groups you formed on Canvas, and the team number must match the group number. Each student in the group is required to sign up and be identifiable as a member of a Kaggle team. Kaggle randomly splits (just once) the listings in the test.csv file into validation (30%) and test (70%) cases, but you will not know which ones are which. When you make a submission during the Kaggle competition, you get a score equal to the RMSE computed on the validation listings. These scores are displayed on the “Public Leaderboard” and provide an ongoing ranking of teams. You can use the scores of your submissions to help you select the best predictive model. You will need to manually select one of your Kaggle submissions to be used as final at the end of the competition. Once the competition is over, Kaggle will rank teams’ final submissions based on the test cases only, and those will be displayed on the “Private Leaderboard”. Your goal is to do as well as possible on the Private Leaderboard at the end of the competition, so please be careful not to overfit the validation cases in an attempt to improve your public ranking. Please note that the competition ends at 4PM on June 4, which is exactly 1 hour before the due time for the assignment report. BUSINESS SCHOOL Page 5 of 5 Real world relevance: The ability to perform in a Kaggle competition is highly valued by employers. Some employers go as far as to set up a Kaggle competition just for recruitment. Bonus marks: The five teams with the best performance on the Private Leaderboard will receive bonus marks for the assignment (with the total Group Project score capped at 100). The best performing team will receive 10 bonus marks, the second team will get 8 marks, the third will get 6 marks, the fourth will get 4 marks, and the fifth will get 2 marks (however, the maximum score will remain at or below 100). Please note that your choice of the final model must be well justified in the report, and the corresponding Kaggle predictions must come from your own analysis in Python. An examination of the code will be conducted for verification purposes. Your code is required to reproduce the winning Kaggle predictions.

学霸联盟