QBUS2820- -无代写-Assignment 1|学霸联盟

QBUS2820- -无代写-Assignment 1

时间：2026-03-31

QBUS2820- Assignment 1: Predicting Restaurant Revenue
1) Background
As a data analyst for a national quick‑service restaurant (QSR) chain, the leadership is looking for
reliable daily revenue forecasts for each outlet to support staffing, inventory, and promotional
decisions. Your task is to build a regression model that predicts Revenue for each outlet‑day,
using the features provided below.
2) Data Provided
You will be given three CSV files (two to students, one used only by the marker):
• Training.csv : labeled training data containing the target column Revenue.
• Test_noLabel.csv : the same columns without Revenue; generate predictions for these
rows.
• Test.csv : the same rows as Test_noLabel.csv but with Revenue; provided only to the
marker for evaluation.
3) Variables
Target Variable
Revenue : Continuous target; daily outlet revenue, measured in thousands of dollars.
Features
Variable Type Description
OutletID Identifier Unique outlet identifier (not
a feature unless engineered).
Date Date Calendar date of observation.
Month Integer (1–12) Month number.
Weekday Binary 1 if Monday–Friday; 0 if
Saturday–Sunday.
Downtown Binary 1 if outlet is located in the
CBD; else 0.
Mall Binary 1 if outlet is inside/adjacent
to a mall, airport or campus;
else 0.
HighIncomeArea Binary 1 if catchment area is high
income; else 0.
OfficesNearby Numeric Nearby office density
(roughly “hundreds of
offices” scale).
CompetitorsNearby Numeric Number of nearby competing
QSRs (~5 km radius).
Promo Binary 1 if a promotion was active
that day; else 0.
EventNearby Binary 1 if a local event occurred
near the outlet that day; else
0.
Rain_mm Numeric Daily rainfall in millimetres
(mm).
LagHigh Binary 1 if the previous month’s
mean revenue for this outlet
exceeded the global median;
else 0.

Important: The dataset contains realistic complications, a small number of outliers (in some
explanatory features as well as the response Revenue) and missing values. Your preprocessing
should handle these appropriately (e.g. imputation, robust choices, and sensible feature
engineering). The dataset has been ‘anonymized’ or obfuscated. This process may modify the
realism or the strength of the variable relationships, therefore, prioritize statistical principles
while making modelling decisions over domain knowledge (e.g. do not assume a variable is
influential or not influential based on just ‘common sense’, but pay attention to variable tha
may make the modelling spurious).
4) Your Tasks
1. Exploratory Data Analysis (EDA): Explore distributions, identify outliers, check missingness
patterns, and examine relationships with the target. Be careful with making modelling
decisions at the exploratory stage, your goal should be to maximize predictive accuracy.
2. Modeling: Fit appropriate regression models to predict Revenue. You may compare multiple
methods covered in class (e.g., linear/regularized models, KNN, variable transformations).
Use only methods covered in class. Keep model selection consistent (e.g., compare on the
same metric and scale).
3. Prediction file: Generate predictions for Test_noLabel.csv from your chosen model save as
SID_Assignment1_prediction.csv with a single column named Revenue.
4. Reproducibility: Ensure your notebook runs end-to-end when without errors when starting
from a clean python session and the last cell prints the test MSE when Test.csv is present.
Assume the training data is in the same folder as the notebook. It should always produce
the same results (remember the random seed). Running time for the notebook should not
be too long, aim for a max of 10 minutes (we will be flexible due to hardware variability, just
make sure it does not take much more than that). This may force you to make decisions
based on theoretical tradeoffs for speeding up the process.
5) Evaluation Metric
We will use Mean Squared Error (MSE) on the hidden test set (Test.csv) to evaluate the
performance of your selected model. Make
Your submitted notebook will be executed with Test.csv in the same folder to compute and print
this MSE.
6) What to Submit
• SID_Assignment1_document.pdf: a clear report (≤ 15 pages, font size 12) describing EDA,
modeling, selection rationale, and conclusions. Report numerical results to four decimal
places. Focus on the important aspects, documenting the reasoning for the decisions and
the tradeoffs expected. Tradeoffs: Every time you make a decision, there are pros and cons,
you should state them (e.g. using holdout validation with a certain validation size trades off
accuracy of the estimation of the error for fidelity to the original training set size).
Descriptions should be enough for data analysts in your field to understand the process and
the decisions made along the way.
• SID_Assignment1_implementation.ipynb: your Python notebook that produces the result
of the report document. Please make sure the notebook executes cleanly end-to-end
(restart and run all).
• SID_Assignment1_prediction.csv: a single column named Revenue with predictions for
Test_noLabel.csv, created from your selected model.
Last cell template (the marker will run this):
import pandas as pd
from sklearn.metrics import mean_squared_error

QSR_test = pd.read_csv("Test.csv") # provided by the marker
y_true = QSR_test["Revenue"].values
# YOUR CODE: load your trained pipeline/model here and predict:
X_hidden = QSR_test.drop(columns=["Revenue"])
# replace below with how you call the model, should reproduce your
submitted csv
y_pred = my_model.predict(X_hidden)
# code below should run as is
test_error = mean_squared_error(y_true, y_pred)
print(test_error)

7) Additional info on formatting
• The report should be well-structured and easy to read; include clear figures/tables.
Maximum 15 pages (including appendices). Make sure figures and tables are readable, have
captions and are referenced in the text. Do not rely on Appendices for completeness, all
main points should be on the body of the report. The report should not be the notebook
converted to pdf, it should be a Word, LaTex or similar document containing the essential
information.

8) Marking Criteria
This assignment is worth 25 marks in total, with 14 marks allocated to the content of the
document.pdf and 11 marks to the Python implementation.
The marking breakdown is as follows:
Prediction accuracy: Your test error will be compared against a baseline model developed by
the teaching team.
• The marker first runs SID Assignment1 implementation.ipynb.
• If the file runs smoothly and produces a test MSE, up to 11 marks will be awarded based on
prediction accuracy relative to the baseline model (outperforming the baseline model will
net you the 11 marks, then it will proportionally subtract marks).
• If the marker cannot run SID Assignment1 implementation.ipynb or if no test MSE is
produced, partial marks (maximum 3, meaning a potential loss of 8 marks) may be awarded
based on the appropriateness of the file.
Report described in SID Assignment1 document.pdf: Up to 14 marks are allocated based on:
• The appropriateness of the chosen prediction method.
• The detail, discussion, and explanation of your data analysis procedure.
• See the Marking Criteria for more details.
CSV File Submission: Up to 2 marks will be deducted if you fail to upload the CSV

9) Final Notes
• If you believe there are errors in the assignment, please contact the teaching team as soon
as possible. We encourage you to read the instructions carefully and seek clarification early
if you are unsure about any requirements.

学霸联盟