MMAI 2024 869-无代写
时间:2023-06-20
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 1/13
Version 1: Updated June 9, 2023
# TODO: ll in the below
[First name, Last name]
[Student number]
[Section number]
[Favorite book]
[Date]
MMA/MMAB/MMAI 869 2024: Individual Assignment
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 2/13
This assignment contains four questions. The questions are fully contained in this Google Colab
Notebook.
You are to make a copy of this Notebook and edit the copy to provide your answers. You are to
complete the assignment entirely within Google Colab. Why?
It gives you practice using cloud-based interactive notebook environments (which is a popular
workow)
It is easier for you to manage the environment (e.g., installing packages, etc.)
Google Colab has nice, beefy machines, so you don't have to worry about running out of
memory on your local computer.
It will be easier for the TA to help you debug your code if you need help
It will be easier for the TA to mark/run your code
Some parts of this assigment require you to write code. Use Python or R. For Python, you may use
standard Python libraries, including scikit-learn , pandas , numpy , and scipy . For R, you may use
dplyr , caret , ggplot2 , rpart and other standard libraries.
Some parts of this assignment require text responses. In these cases, type your response in the
Notebook cell indicated. Use English. Use proper grammar, spelling, and punctuation. Be
professional and clear. Be complete, but not overly-verbose. Feel free to use Markdown syntax to
format your answer (i.e., add bold, italics, lists, tables).
What to Submit to the Course Portal
Export your completed Notebook as a PDF le by clicking File->Print->Save as PDF.
Please do not submit the Notebook le ( .ipynb ) to the course portal.
Please submit the PDF export of the Notebook.
Please name the PDF le 2024_869_FirstnameLastName.pdf
E.g., 2024_869_StephenThomas.pdf
Please make sure you have run all the cells so we can see the output!
Best practice: Before exporting to PDF click Runtime->Restart and run all.
Assignment Instructions
No action is required on your part in this section. These cells print out helpful information about the
environment, just in case.
Preliminaries: Inspect and Set up environment
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 3/13
import datetime
import pandas as pd
import numpy as np
print(datetime.datetime.now())
2021-06-08 12:40:41.157837
!which python
/usr/local/bin/python
!python --version
Python 3.7.10
!echo $PYTHONPATH
/env/python
# TODO: install any packages you need to here. For example:
#pip install unidecode
Question 1: Uncle Steve's Diamonds
You work at a local jewelry store named Uncle Steve's Diamonds. You started as a janitor, but you’ve
recently been promoted to senior data analyst. Congratulations!
Uncle Steve, the store's owner, needs to better understand the store's customers. In particular, he
wants to know what kind of customers shop at the store. He wants to know the main types of
customer personas. Once he knows these, he will contemplate ways to better market to each
persona, better satisfy each persona, better cater to each persona, increase the loyalty of each
persona, etc. But rst, he must know the personas.
You want to help Uncle Steve. Using sneaky magic (and the help of Environics), you've collected
four useful features for a subset of the customers: age, income, spending score (i.e., a score based
on how much they’ve spent at the store in total), and savings (i.e., how much money they have in
their bank account).
Instructions
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 4/13
Your tasks
1. Pick a clustering algorithm (the sklearn.cluster module has many good choices, including
KMeans , DBSCAN , and AgglomerativeClustering ). (Note that another popular implementation
of the hierarchical algorithm can be found in SciPy's scipy.cluster.hierarchy.linkage .)
Don't spend a lot of time thinking about which algorithm to choose - just pick one. Cluster the
customers as best as you can, within reason. That is, try different feature preprocessing steps,
hyperparameter values, and/or distance metrics. You don't need to try every posssible
combination, but try a few at least. Measure how good each model conguration is by
calculating an internal validation metric (e.g., calinski_harabasz_score or
silhouette_score ).
2. You have some doubts - you're not sure if the algorithm you chose in part 1 is the best
algorithm for this dataset/problem. Neither is Uncle Steve. So, choose a different algorithm
(any!) and do it all again.
3. Which clustering algorithm is "better" in this case? Think about charateristics of the algorithm
like quality of results, ease of use, speed, interpretability, etc. Choose a "winner" and justify to
Uncle Steve.
4. Interpret the clusters of the winning model. That is, describe, in words, a persona that
accurately depicts each cluster. Use statistics (e.g., cluster means/distributions), examples
(e.g., exemplar instances from each cluster), and/or visualizations (e.g., relative importance
plots, snakeplots) to get started. Human judgement and creativity will be necessary. This is
where it all comes together. Be descriptive and help Uncle Steve understand his customers
better. Please!
Marking
The coding parts (i.e., 1 and 2) will be marked based on:
Correctness. Code clearly and fully performs the task specied.
Reproducibility. Code is fully reproducible. I.e., you (and I) are able to run this Notebook again
and again, from top to bottom, and get the same results each time.
Style. Code is organized. All parts commented with clear reasoning and rationale. No old code
laying around. Code easy to follow.
Parts 3 and 4 will be marked on:
Quality. Response is well-justied and convincing. Responses uses facts and data where
possible.
Style. Response uses proper grammar, spelling, and punctuation. Response is clear and
professional. Response is complete, but not overly-verbose. Response follows length
guidelines.
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 5/13
Tips
Since clustering is an unsupervised ML technique, you don't need to split the data into
training/validation/test or anything like that. Phew!
On the ip side, since clustering is unsupervised, you will never know the "true" clusters, and
so you will never know if a given algorithm is "correct." There really is no notion of
"correctness" - only "usefullness."
Many online clustering tutorials (including some from Uncle Steve) create ashy
visualizations of the clusters by plotting the instances on a 2-D graph and coloring each point
by the cluster ID. This is really nice and all, but it can only work if your dataset only has exactly
two features - no more, no less. This dataset has more than two features, so you cannot use
this technique. (But that's OK - you don't need to use this technique.)
Must you use all four features in the clustering? Not necessarily, no. But "throwing away"
quality data, for no reason, is unlikely to improve a model.
Some people have success applying a dimensionality reduction technique (like
sklearn.decomposition.PCA ) to the features before clustering. You may do this if you wish,
although it may not be as helpful in this case because there are only four features to begin
with.
If you apply a transformation (e.g., MinMaxScaler or StandardScaler ) to the features before
clustering, you may have diculty interpretting the means of the clusters (e.g., what is a mean
Age of 0.2234??). There are two options to x this: rst, you can always reverse a
transformation with the inverse_transform method. Second, you can just use the original
dataset (i.e., before any preprocessing) during the interpreation step.
You cannot change the distance metric for K-Means. (This is for theoretical reasons: K-Means
only works/makes sense with Euclidean distance.)
1.0: Load data
# DO NOT MODIFY THIS CELL
df1 = pd.read_csv("https://drive.google.com/uc?export=download&id=1thHDCwQK3GijytoSSZNekAsItN_
df1.info()
RangeIndex: 505 entries, 0 to 504
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 505 non-null int64
1 Income 505 non-null int64
2 SpendingScore 505 non-null float64
3 Savings 505 non-null float64
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 6/13
dtypes: float64(2), int64(2)
memory usage: 15.9 KB
1.1: Clustering Algorithm #1
# TODO: delete this comment and insert code here. Feel free to add more code cells as appropr
1.2: Clustering Algorithm #2
# TODO: delete this comment and insert code here. Feel free to add more code cells as appropr
1.3 Model Comparison
TODO: Delete this text and insert your answer here.
1.4 Personas
TODO: Delete this text and insert your answer here.
Question 2: Uncle Steve's Fine Foods
Uncle Steve runs a small, local grocery store in Ontario. The store sells all the normal food staples
(e.g., bread, milk, cheese, eggs, more cheese, fruits, vegatables, meat, sh, waes, ice cream, pasta,
cereals, drinks), personal care products (e.g., toothpaste, shampoo, hair goo), medicine, and cakes.
There's even a little section with owers and greeting cards! Normal people shop here, and buy
normal things in the normal way.
Business is OK but Uncle Steve wants more. He's thus on the hunt for customer insights. Given your
success at the jewelry store, he has asked you to help him out.
He has given you a few years' worth of customer transactions, i.e., sets of items that customers
have purchased. You have applied an association rules learning algorithm (like Apriori) to the data,
Instructions
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 7/13
and the algorithm has generated a large set of association rules of the form {X} -> {Y} , where
{X} and {Y} are item-sets.
Now comes a thought experiment. For each of the following scenarios, state what one of the
discovered association rules might be that would meet the stated condition. (Just make up the rule,
using your human experience and intuition.) Also, describe whether and why each rule would be
considered interesting or uninteresting for Uncle Steve (i.e., is this insight new to him? Would he be
able to use it somehow?).
Keep each answer to 600 characters or less (including spaces).
To get those brain juices going, an example condition and answer is provided below:
Condition: A rule that has high support.
Answer: The rule {milk} -> {bread} would have high support, since milk and bread
are household staples and a high percentage of transactions would include both
{milk} and {bread} . Uncle Steve would likely not nd this rule interesting, because
these items are so common, he would have surely already noticed that so many
transactions contain them.
Marking
Your responses will be marked as follows:
Correctness. Rule meets the speciced condition, and seems plausible in an Ontario grocery
store.
Justication of interestness. Response clearly describes whether and why the rule would be
considered interesting to Uncle Steve.
Tips
There is no actual data for this question. This question is just a thought exercise. You need to
use your intuition, creatitivty, and understanding of the real world. I assume you are familiar
with what happens inside of normal grocery stores. We are not using actual data and you do
not need to create/generate/nd any data. I repeat: there is no data for this question.
The reason this question is having you do a thought experiment, rather than writing and
running code to nd actual association rules on an actual dataset, is because writing code to
nd association rules is actually pretty easy. But using your brain to come up with rules that
meet certain criteria, on the other hand, is a true test of whether you understand how the
algorithm works, what support and condence mean, and the applicability of rules. The
question uses the grocery store context because most, if not all, students should be familiar
from personal experience.
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 8/13
2.1: A rule that might have high support and high condence.
TODO: Delete this text and insert your answer here.
2.2: A rule that might have reasonably high support but low
condence.
TODO: Delete this text and insert your answer here.
2.3: A rule that might have low support and low condence.
TODO: Delete this text and insert your answer here.
2.4: A rule that might have low support and high condence.
TODO: Delete this text and insert your answer here.
Question 3: Uncle Steve's Credit Union
Uncle Steve has recently opened a new credit union in Kingston, named Uncle Steve's Credit Union.
He plans to disrupt the local market by instaneously providing credit to customers.
The rst step in Uncle Steve's master plan is to create a model to predict whether an application
has good risk or bad risk. He has outsourced the creation of this model to you.
You are to create a classication model to predict whether a loan applicant has good risk or bad
risk. You will use data that Uncle Steve bought from another credit union (somewhere in Europe, he
thinks?) that has around 6000 instances and a number of demographics features (e.g., Sex ,
DateOfBirth , Married ), loan details (e.g., Amount , Purpose ), credit history (e.g., number of loans),
as well as an indicator (called BadCredit in the dataset) as to whether that person was a bad risk.
Instructions
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 9/13
Your tasks
To examine the effects of the various ML stages, you are to create the model several times, each
time adding more sophistication, and measuring how much the model improved (or not). In
particular, you will:
0. Split the data in training and testing. Don't touch the testing data again, for any reason, until
step 5. We are pretending that the testing data is "future, unseen data that our model won't
see until production." I'm serious, don't touch it. I'm watching you!
1. Build a baseline model - no feature engineering, no feature selection, no hyperparameter
tuning (just use the default settings), nothing fancy. (You may need to do some basic feature
transformations, e.g., encoding of categorical features, or dropping of features you do not
think will help or do not want to deal with yet.) Measure the performance using K-fold cross
validation (recommended: sklearn.model_selection.cross_val_score ) on the training data.
Use at least 5 folds, but more are better. Choose a scoring parameter (i.e., classication
metric) that you feel is appropriate for this task. Don't use accuracy. Print the mean score of
your model.
2. Add a bit of feature engineering. The sklearn.preprocessing module contains many useful
transformations. Engineer at least three new features. They don't need to be especially
ground-breaking or complicated. Dimensionality reduction techniques like
sklearn.decomposition.PCA are fair game but not required. (If you do use dimensionality
reduction techniques, it would only count as "one" new feature for the purposes of this
assignment, even though I realize that PCA creates many new "features" (i.e., principal
components).) Re-train your baseline model. Measure performance. Compare to step 1.
3. Add feature selection. The sklearn.feature_selection has some algorithms for you to
choose from. After selecting features, re-train your model, measure performance, and
compare to step 2.
4. Add hyperparameter tuning. Make reasonable choices and try to nd the best (or at least,
better) hyperparameters for your estimator and/or transformers. It's probably a good idea to
stop using cross_val_score at this point and start using
sklearn.model_selection.GridSearchCV as it is specically built for this purpose and is more
convienient to use. Measure performance and compare to step 3.
5. Finally, using your ndings from the previous steps, estimate how well your model will work in
production. Use the testing data (our "future, unseen data") from step 0. Transform the data as
appropriate (easy if you've built a pipeline, a little more dicult if not), use the model from
step 4 to get predictions, and measure the performance. How well did we do?
Marking
Each part will be marked for:
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 10/13
Correctness. Code clearly and fully performs the task specied.
Reproducibility. Code is fully reproducible. I.e., you (and I) should be able to run this Notebook
again and again, from top to bottom, and get the same results each and every time.
Style. Code is organized. All parts commented with clear reasoning and rationale. No old code
laying around. Code easy to follow.
Tips
The origins of the dataset are a bit of a mystery. Assume the data set is recent (circa 2022)
and up-to-date. Assume that column names are correct and accurate.
You don't need to experiment with more than one algorithm/estimator. Just choose one (e.g.,
sklearn.tree.DecisionTreeClassifier , sklearn.ensemble.RandomForestClassifier ,
sklearn.linear_model.LogisticRegression , sklearn.svm.LinearSVC , whatever) and stick
with it for this question.
There is no minimum accuracy/precision/recall for this question. I.e., your mark will not be
based on how good your model is. Rather, you mark will be based on good your process is.
Watch out for data leakage and overtting. In particular, be sure to fit() any estimators and
transformers (collectively, objects) only to the training data, and then use the objects'
transform() methods on both the training and testing data. Data School has a helpful video
about this. Pipelines are very helpful here and make your code shorter and more robust (at the
expense of making it harder to understand), and I recommend using them, but they are not
required for this assignment.
Create as many code cells as you need. In general, each cell should do one "thing."
Don't print large volumes of output. E.g., don't do: df.head(100)
3.0: Load data and split
# DO NOT MODIFY THIS CELL
# First, we'll read the provided labeled training data
df3 = pd.read_csv("https://drive.google.com/uc?export=download&id=1wOhyCnvGeY4jplxI8lZ-bbYN3z
df3.info()
from sklearn.model_selection import train_test_split
X = df3.drop('BadCredit', axis=1) #.select_dtypes(['number'])
y = df3['BadCredit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 17 columns):
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 11/13
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UserID 6000 non-null object
1 Sex 6000 non-null object
2 PreviousDefault 6000 non-null int64
3 FirstName 6000 non-null object
4 LastName 6000 non-null object
5 NumberPets 6000 non-null int64
6 PreviousAccounts 6000 non-null int64
7 ResidenceDuration 6000 non-null int64
8 Street 6000 non-null object
9 LicensePlate 6000 non-null object
10 BadCredit 6000 non-null int64
11 Amount 6000 non-null int64
12 Married 6000 non-null int64
13 Duration 6000 non-null int64
14 City 6000 non-null object
15 Purpose 6000 non-null object
16 DateOfBirth 6000 non-null object
dtypes: int64(8), object(9)
memory usage: 797.0+ KB
3.1: Baseline model
# TODO: Insert code here. Feel free to create additional code cells if necessary.
3.2: Adding feature engineering
# TODO: Insert code here. Feel free to create additional code cells if necessary.
3.3: Adding feature selection
# TODO: Insert code here. Feel free to create additional code cells if necessary.
3.4: Adding hyperparameter tuning
# TODO: Insert code here. Feel free to create additional code cells if necessary.
3.5: Performance estimation on testing data
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 12/13
# TODO: Insert code here. Feel free to create additional code cells if necessary.
Question 4: Uncle Steve's Wind Farm
Uncle Steve has invested in wind. He's built a BIG wind farm with a total of 700 turbines. He's been
running the farm for a couple of years now and things are going well. He sells the power generated
by the farm to the Kingston government and makes a tidy prot. And, of course, he has been
gathering data about the turbines' operations.
One area of concern, however, is the cost of maintenece. While the turbines are fairly robust, it
seems like one breaks/fails every couple of days. When a turbine fails, it usually costs around
$20,000 to repair it. Yikes!
Currently, Uncle Steve is not doing any preventative maintenance. He just waits until a turbine fails,
and then he xes it. But Uncle Steve has recently learned that if he services a turbine before it fails,
it will only cost around $2,000.
Obviously, there is a potential to save a lot of money here. But rst, Uncle Steve would need to gure
out which turbines are about to fail. Uncle Steve being Uncle Steve, he wants to use ML to build a
predictive maintenance model. The model will alert Uncle Steve to potential turbine failures before
they happen, giving Uncle Steve a chance to perform an inspection on the turbine and then x the
turbine before it fails. Uncle Steve plans to run the model every morning. For all the turbines that the
model predicts will fail, Uncle Steve will order an inspection (which cost a at $500, no matter if the
turbine was in good health or not; the $500 would not be part of the $2,000 service cost). For the
rest of the turbines, Uncle Steve will do nothing.
Uncle Steve has used the last few year's worth of operation data to build and assess a model to
predict which turbines will fail on any given day. (The data includes useful features like sensor
readings, power output, weather, and many more, but those are not important for now.) In fact, he
didn't stop there: he built and assessed two models. One model uses using deep learning (in this
case, RNNs), and the other uses random forests.
He's tuned the bejeebers out of each model and is comfortable that he has found the best-
performing version of each. Both models seem really good: both have accuracy scores > 99%. The
RNN has better recall, but Uncle Steve is convinced that the random forest model will be better for
him since it has better precision. Just to be sure, he has hired you to double check his calculations.
Your task
Instructions
6/19/23, 10:35 PM MMA/MMAB/MMAI 2024 869 Individual Assignment.ipynb - Colaboratory
https://colab.research.google.com/drive/1yJZ6E_eX9-zWxeKfxjyup7jS-rvTHGzN?usp=sharing#scrollTo=HKmorPdno_n_ 13/13
Colab paid products - Cancel contracts here
Which model will save Uncle Steve more money? Justify.
In addition to the details above, here is the assessment of each model:
Confusion matrix for the random forest:
Predicted Fail Predicted No Fail
Actual Fail 201 55
Actual No Fail 50 255195
Confusion matrix for the RNN:
Predicted Fail Predicted No Fail
Actual Fail 226 30
Actual No Fail 1200 254045
Marking
Quality. Response is well-justied and convincing.
Style. Response uses proper grammar, spelling, and punctuation. Response is clear and
professional. Response is complete, but not overly-verbose. Response follows length
guidelines.