QBUS2820- Assignment 1: Predicting Restaurant Revenue 1) Background As a data analyst for a national quick‑service restaurant (QSR) chain, the leadership is looking for reliable daily revenue forecasts for each outlet to support staffing, inventory, and promotional decisions. Your task is to build a regression model that predicts Revenue for each outlet‑day, using the features provided below. 2) Data Provided You will be given three CSV files (two to students, one used only by the marker): • Training.csv : labeled training data containing the target column Revenue. • Test_noLabel.csv : the same columns without Revenue; generate predictions for these rows. • Test.csv : the same rows as Test_noLabel.csv but with Revenue; provided only to the marker for evaluation. 3) Variables Target Variable Revenue : Continuous target; daily outlet revenue, measured in thousands of dollars. Features Variable Type Description OutletID Identifier Unique outlet identifier (not a feature unless engineered). Date Date Calendar date of observation. Month Integer (1–12) Month number. Weekday Binary 1 if Monday–Friday; 0 if Saturday–Sunday. Downtown Binary 1 if outlet is located in the CBD; else 0. Mall Binary 1 if outlet is inside/adjacent to a mall, airport or campus; else 0. HighIncomeArea Binary 1 if catchment area is high income; else 0. OfficesNearby Numeric Nearby office density (roughly “hundreds of offices” scale). CompetitorsNearby Numeric Number of nearby competing QSRs (~5 km radius). Promo Binary 1 if a promotion was active that day; else 0. EventNearby Binary 1 if a local event occurred near the outlet that day; else 0. Rain_mm Numeric Daily rainfall in millimetres (mm). LagHigh Binary 1 if the previous month’s mean revenue for this outlet exceeded the global median; else 0. Important: The dataset contains realistic complications, a small number of outliers (in some explanatory features as well as the response Revenue) and missing values. Your preprocessing should handle these appropriately (e.g. imputation, robust choices, and sensible feature engineering). The dataset has been ‘anonymized’ or obfuscated. This process may modify the realism or the strength of the variable relationships, therefore, prioritize statistical principles while making modelling decisions over domain knowledge (e.g. do not assume a variable is influential or not influential based on just ‘common sense’, but pay attention to variable tha may make the modelling spurious). 4) Your Tasks 1. Exploratory Data Analysis (EDA): Explore distributions, identify outliers, check missingness patterns, and examine relationships with the target. Be careful with making modelling decisions at the exploratory stage, your goal should be to maximize predictive accuracy. 2. Modeling: Fit appropriate regression models to predict Revenue. You may compare multiple methods covered in class (e.g., linear/regularized models, KNN, variable transformations). Use only methods covered in class. Keep model selection consistent (e.g., compare on the same metric and scale). 3. Prediction file: Generate predictions for Test_noLabel.csv from your chosen model save as SID_Assignment1_prediction.csv with a single column named Revenue. 4. Reproducibility: Ensure your notebook runs end-to-end when without errors when starting from a clean python session and the last cell prints the test MSE when Test.csv is present. Assume the training data is in the same folder as the notebook. It should always produce the same results (remember the random seed). Running time for the notebook should not be too long, aim for a max of 10 minutes (we will be flexible due to hardware variability, just make sure it does not take much more than that). This may force you to make decisions based on theoretical tradeoffs for speeding up the process. 5) Evaluation Metric We will use Mean Squared Error (MSE) on the hidden test set (Test.csv) to evaluate the performance of your selected model. Make Your submitted notebook will be executed with Test.csv in the same folder to compute and print this MSE. 6) What to Submit • SID_Assignment1_document.pdf: a clear report (≤ 15 pages, font size 12) describing EDA, modeling, selection rationale, and conclusions. Report numerical results to four decimal places. Focus on the important aspects, documenting the reasoning for the decisions and the tradeoffs expected. Tradeoffs: Every time you make a decision, there are pros and cons, you should state them (e.g. using holdout validation with a certain validation size trades off accuracy of the estimation of the error for fidelity to the original training set size). Descriptions should be enough for data analysts in your field to understand the process and the decisions made along the way. • SID_Assignment1_implementation.ipynb: your Python notebook that produces the result of the report document. Please make sure the notebook executes cleanly end-to-end (restart and run all). • SID_Assignment1_prediction.csv: a single column named Revenue with predictions for Test_noLabel.csv, created from your selected model. Last cell template (the marker will run this): import pandas as pd from sklearn.metrics import mean_squared_error QSR_test = pd.read_csv("Test.csv") # provided by the marker y_true = QSR_test["Revenue"].values # YOUR CODE: load your trained pipeline/model here and predict: X_hidden = QSR_test.drop(columns=["Revenue"]) # replace below with how you call the model, should reproduce your submitted csv y_pred = my_model.predict(X_hidden) # code below should run as is test_error = mean_squared_error(y_true, y_pred) print(test_error) 7) Additional info on formatting • The report should be well-structured and easy to read; include clear figures/tables. Maximum 15 pages (including appendices). Make sure figures and tables are readable, have captions and are referenced in the text. Do not rely on Appendices for completeness, all main points should be on the body of the report. The report should not be the notebook converted to pdf, it should be a Word, LaTex or similar document containing the essential information. 8) Marking Criteria This assignment is worth 25 marks in total, with 14 marks allocated to the content of the document.pdf and 11 marks to the Python implementation. The marking breakdown is as follows: Prediction accuracy: Your test error will be compared against a baseline model developed by the teaching team. • The marker first runs SID Assignment1 implementation.ipynb. • If the file runs smoothly and produces a test MSE, up to 11 marks will be awarded based on prediction accuracy relative to the baseline model (outperforming the baseline model will net you the 11 marks, then it will proportionally subtract marks). • If the marker cannot run SID Assignment1 implementation.ipynb or if no test MSE is produced, partial marks (maximum 3, meaning a potential loss of 8 marks) may be awarded based on the appropriateness of the file. Report described in SID Assignment1 document.pdf: Up to 14 marks are allocated based on: • The appropriateness of the chosen prediction method. • The detail, discussion, and explanation of your data analysis procedure. • See the Marking Criteria for more details. CSV File Submission: Up to 2 marks will be deducted if you fail to upload the CSV 9) Final Notes • If you believe there are errors in the assignment, please contact the teaching team as soon as possible. We encourage you to read the instructions carefully and seek clarification early if you are unsure about any requirements.
学霸联盟