Python代写|Assignment代写 - AD654: Marketing Analytics
时间:2020-11-14
Once you have completed this assignment, you will upload two files into Blackboard: The
.ipynb file that you create in Jupyter Notebook, and an .html file that was generated from your
.ipynb file. If you run into any trouble with submitting the .html file to Blackboard, you can
submit it as a PDF instead.
For any question that asks you to perform some particular task, you just need to show your
input and output in Jupyter Notebook. Tasks will always be written in regular, non-italicized
font.
For any question that asks you to include interpretation, write your answer in a Markdown cell
in Jupyter Notebook. Any homework question that needs interpretation will be written in
italicized font. Do not simply write your answer in a code cell as a comment, but use a
Markdown cell instead.
Remember to be resourceful! There are many helpful resources available to you, including the
video library, the class slides, the recitation sessions, the Zoom office hours sessions, and the
web.
Determining CLV at the Gold Zone
This homework will have a somewhat different feel than the other ones that we have done this
semester -- instead of using a step-by-step set of directions, it will be a bit more general. As
always, the only steps that must be answered in a Markdown cell are the parts for which the
prompt uses italic sentences or questions.
Exploratory Data Analysis (EDA): Read the goldzoneplayers.csv into your environment.
Then, clean up any NaNs that may be hiding out in the dataset -- remove them completely.
Generate up to 8 different visualizations for this dataset. For each visualization that you make,
describe what it’s showing in 2-3 sentences. As you consider visualizations for this dataset,
think about relationships among variables. In this assignment, you will create a model that
PlayerID An arbitrary number from 1 to 170, uniquely assigned to each Gold Zone member. Gender A categorical variable with M for Male, and F for Female MarStatus M for Married; S for Single Age A numeric variable representing the person’s age, in years, at the time of 2019 sign-up or renewal. 18 is the minimum age for Gold Zone membership. Education An “HS” in this category indicates that high school is the highest education completed. A “C” indicates that the person has at least some college education. A “G” indicates grad school education. Employment An “S” in this category indicates a salaried employee. An “H” in this category indicates an hourly-wage employee. A “U” in this category indicates that the person is unemployed. Consecutive The number of years that the person has been a GoldZone member. A “1” in this column means that 2019 was the person’s first season as a member. Even though 18 is the minimum membership age, an 18 year-old could have a number greater than one, as families may pass down memberships (there is a discount for returning members that becomes larger as the consecutive years number goes higher). TotalSpend The person’s total discretionary spending for the 2019 season, rounded to the nearest dollar.
predicts total spending, so it may help to see how that variable relates to certain other features,
and combinations of features.
Pass any two categories to the groupby() function in pandas, and then use describe() to learn
about how TotalSpend varies among the groups formed by combining the levels for these
categories. Write 2-3 sentences about what these results show.
Linear Regression Model:
Create a linear regression model that aims to predict the amount of money that a person will
spend at the GoldZone in one summer (even though the ‘LV’ in ‘CLV’ stands for ‘Lifetime
Value’, it is often calculated over a shorter period). Use all the variables in the dataset, but be
sure to drop one level from any categorical input variables. Before you create the model,
create a data partition that sends ⅓ of your data to ‘test’ and ⅔ to ‘train’. Use LinearRegression
from scikit-learn.
Write a paragraph that explains the meaning of each of your model’s coefficients.
Generate the following statistics for your linear regression model: RMSE, r-squared, median
absolute error. Generate these stats for your model’s performance against both train and test
(but remember, you will only build the model with train).
Decision Tree Regressor Model:
Using the DecisionTreeRegressor module from scikit-learn, build a regression tree model for
this dataset. Convert categorical variables in the dataset with astype(‘category’) beforehand.
Pass all levels of all your input variables to the model this time -- in a tree, there’s no need to
drop a level.
Create a visualization of your regression tree model. Write a paragraph that describes any four
rules generated by your tree. (A rule is formed by tracing the path of a record from the top of
the tree to the bottom).
Generate the following statistics for your tree model: RMSE, r-squared, median absolute error.
Generate these stats for your model’s performance against both train and test (but remember,
you will only build the model with train).
Model Comparison:
Use a histogram to visualize the distribution of the difference between linear regression model
predictions against the test set and actual y_test values.
Then, do the same thing for the difference between the regression tree model predictions
against the test set and the actual y_test values.
How are these plots different? What does this suggest?
How did the model statistics (r-squared, RMSE, MAE) differ as you switched from linear
regression to the regression tree?
In 3-4 sentences, why might Lobster Land care about a model like this? What could it do with
a model that could effectively make such a prediction?
Write a thoughtful paragraph that speculates about the relative performance of linear
regression vs. a tree model for this particular dataset. To answer this, you don’t need any
particularly sophisticated knowledge of either type of model type -- all you need to is take a
close look at what you have seen and found in the steps above. Think about what the linear
regression coefficients indicate, and then consider what your tree model shows you, as well as
what you learned during the EDA phase. Now wait, one last thing -- stop and think for a
moment about the model type whose performance was worse in this assignment -- could you
ever think of a scenario where it might be better than the other one?
Side Note: Comparisons between these two modeling types tend to come up from time to time in data science interviews. If you can remember this example (or another one like it) you will have a great answer for a question about instances in which you may want to use one vs. the other.