COMM5007-python代写
时间:2023-04-25
Lecture Slides
Week 8 – Machine Learning
UNSW Business School
COMM5007 Coding for business
Term 1, 2023
Lecturer-in-Charge: Dr. Henry KF Cheung (kf.cheung@unsw.edu.au)
2
3Copyright
• There are some file-sharing websites that specialise in buying and selling
academic work to and from university students.
• If you upload your original work to these websites, and if another student
downloads and presents it as their own either wholly or partially, you
might be found guilty of collusion — even years after graduation.
• These file-sharing websites may also accept purchase of course
materials, such as copies of lecture slides and tutorial handouts. By law,
the copyright on course materials, developed by UNSW staff in the
course of their employment, belongs to UNSW. It constitutes copyright
infringement, if not academic misconduct, to trade these materials.
4Country
• UNSW Business School acknowledges the
Bidjigal (Kensington campus) and Gadigal
(City campus) the traditional custodians of
the lands where each campus is located.
• We acknowledge all Aboriginal and Torres
Strait Islander Elders, past and present and
their communities who have shared and
practiced their teachings over thousands of
years including business practices.
• We recognize Aboriginal and Torres Strait
Islander people’s ongoing leadership and
contributions, including to business,
education and industry.
UNSW Business School. (2022, August 18). Acknowledgement of Country [online video].
Retrieved from https://vimeo.com/369229957/d995d8087f
5
6Predictive Modelling on Classification
7Revision
•Regression Analysis using Python – An Example
8Simple Linear Regression Example
•A real estate agent wishes to examine the relationship
between the selling price of a home and its size (measured in
square feet)
• "kc_house_data.csv" dataset
9Simple Linear Regression Example:
Regression Line
•House price vs living area: regression line
10Simple Linear Regression Example:
Regression Line
•House price vs living area: regression line
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
df = pd.read_csv("kc_house_data.csv")
regr = linear_model.LinearRegression()
df['price_inMillion'] = df["price"]/1000000
price = df[['price_inMillion']]
livingArea = df[['sqft_living']]
regr.fit(livingArea, price)
predictPrice = regr.coef_[0][0]*livingArea + regr.intercept_[0]
plt.plot(np.array(livingArea), np.array(predictPrice), '-r')
plt.xlabel("Living Area-Square Footage of the Room")
plt.ylabel("Price (in Million)")
plt.title('Price vs Living Area', fontsize = 14)
Estimated_Price = 0.00028 * sqft_living – 0.04358
11Simple Linear Regression Example:
Regression Line
•House price vs living area: scatter plot and regression line
12Simple Linear Regression Example:
Regression Line
•House price vs living area: scatter plot and regression line
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
df = pd.read_csv("kc_house_data.csv")
regr = linear_model.LinearRegression()
df['price_inMillion'] = df["price"]/1000000
price = df[['price_inMillion']]
livingArea = df[['sqft_living']]
regr.fit(livingArea, price)
predictPrice = regr.coef_[0][0]*livingArea + regr.intercept_[0]
plt.plot(np.array(livingArea), np.array(predictPrice), '-r')
plt.scatter(df.sqft_living, df.price_inMillion, color = 'Black')
plt.xlabel("Living Area-Square Footage of the Room")
plt.ylabel("Price (in Million)")
plt.title('Price vs Living Area', fontsize = 14)
13Simple Linear Regression Example:
Regression Line
•House price vs living area: use the model to predict
14Simple Linear Regression Example:
Regression Line
•House price vs living area: scatter plot and prediction line
import numpy as np
#Provide a list of living area values of interest
x = np.array([10300, 11400, 12500, 13550, 14600, 15650]).reshape(-1, 1)
y_pred = regr.predict(x) # we have created the model called regr
print('Predicted price (in Million):', y_pred, sep = '\n')
plt.scatter(df.sqft_living, df.price_inMillion, color = 'Black')
predictPrice = regr.coef_[0][0]*livingArea + regr.intercept_[0]
plt.plot(np.array(livingArea), np.array(predictPrice), '-r')
plt.xlabel("Living Area-Square Footage of the a")
plt.ylabel("Price (in Million)")
plt.title('Price vs Living Area', fontsize=14)
plt.scatter(x, y_pred, color = 'blue')
plt.show()
15Simple Linear Regression with
statsmodels
• There is multiple library supporting regression analysis in
Python
• statsmodels is a Python module that provides classes and
functions for the estimation of many different statistical
models, as well as for conducting statistical tests, and
statistical data exploration
•We can use it for linear regression as well
import statsmodels.api as sm
lm = sm.OLS.from_formula('price ~ livingArea', df)
result = lm.fit()
print(result.summary())
16Simple Linear Regression with
statsmodels
• In order to fit a simple linear regression model using least
squares, we use the function from_formula().
• The syntax from_formula( ∼ ) is used to fit a model with a
predictor . The summary() function outputs the regression
coefficients for all the predictors.
17Simple Linear Regression—Statistical
Interpretation
18Simple Linear Regression—Statistical
Interpretation
Number of
Observations
in the dataset
The measurement of how
much of the variance in
dependent variable is
explained by the model
Value of Slope
and Intercept Significance of Slope and
Intercept
19
Multiple Linear Regression
• In order to fit a multiple linear regression model using least
squares, we again use the function from_formula().
• The syntax from_formula( ∼ 1 + 2 + 3) is used to fit a
model with three predictors, 1, 2, and 3.
The summary() function now outputs the regression
coefficients for all the predictors.
import statsmodels.api as sm
lm2 = sm.OLS.from_formula('price ~ livingArea + waterfront + condition', df)
result2 = lm2.fit()
print(result2.summary())
20
Multiple Linear Regression
21
Outline
•Classification
•Predictive Modeling on a Classification Task using Python
⎻ Train test dataset split
⎻ Select and fit a model
⎻ Prediction
⎻ Evaluate the model
22
Classification
23
Classification vs. Regression
Supervised machine learning applications
Regression
• Predict continuous output values given input variables
Classification
• Predicting categories given input variables
24
Binary vs Multi-class Classification
•A classification problem with only 2 classes is referred to as
binary classification
⎻ The output labels are 0 or 1
⎻ E.g., fake news or not, spam or no-spam email
•A problem with 3 or more classes is referred to as multi-class
classification
25
Predictive Modeling on a
Classification Task using Python
26
Machine Learning
•Machine Learning focuses on methods that learn from data
and make predictions on unseen data
Labeled Data
New Data
Machine Learning
algorithm
Learned
model
Prediction
Training
Prediction
27Workflow Diagram for Machine
Learning in Predictive Modeling
https://www.researchgate.net/figure/Training-and-testing-our-machine-learning-approach_fig2_318132501
• Step 1: split labeled dataset
into training and test datasets
• Step 2: select and fit a
predictive model to data
• Step 3: predict test dataset
• Step 4: evaluate the model
performance
28Workflow Diagram for Machine
Learning in Predictive Modeling
https://www.researchgate.net/figure/Training-and-testing-our-machine-learning-approach_fig2_318132501
• Step 1: split labeled dataset
into training and test datasets
• Step 2: select and fit a
predictive model to data
• Step 3: predict test dataset
• Step 4: evaluate the model
performance
29
Labeled Data
30
Train Test Split Procedure
• Train-test split is a procedure that allows you to split the data
set into two sets – a training set and a testing set.
Image Source: https://builtin.com/data-science/train-test-split
31
Train Test Split Procedure
• The training set (in-sample
data or training data) is the set
of data we analyze (train on) to
design the rules in the model
• The test set is a set of data we
did not use to train our model
or use in the validation set to
inform our choice of
parameters/input features.
32
Dataset – Social_Network_Ads.csv
•Purchased: binary variable (1 indicates purchase; 0 indicates
not purchase)
•UserID
•Gender
•Age
•EstimatedSalary
• 400 observations
33
Create Dummy Variable
• Gender is “text-type” (male; female), which is not acceptable for
modeling with scikit-learn library.
• pandas library’s get_dummies() is a method that converts categorical
variables into dummy variables
• Dummy variables are binary variables that take value of either 0 or 1.
pd.get_dummies(datasets.Gender, drop_first = True)
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
34
Problem Setting
• To predict purchase behaviors by Age and EstimatedSalary
import pandas as pd
datasets = pd.read_csv('Social_Network_Ads.csv')
X = datasets.iloc[:, [2,3]].values #Age and EstimatedSalary in 2D format
Y = datasets.iloc[:, 4].values #Purchased
35
Train Test Split in Python
80% 20%
train_test_split() method randomly splits the
given dataset into train and test subsets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
• Split the dataset into training and test datasets with
80%:20%
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test =
train_test_split(X, Y, test_size = 0.2, random_state
= 0)
print(X_Train.shape) #2D format; training set size
print(X_Test.shape)
36
Feature Scaling
• Feature scaling is the process of normalizing the range of
features in a dataset
• Training and test datasets are comparable to each other at the
scale
•Standardize features by removing the mean and scaling to
unit variance.
=
−
: mean of the training samples
: standard deviation of the training samples
37
Feature Scaling in Python
•Standardization
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_Train = sc_X.fit_transform(X_Train)
X_Test = sc_X.transform(X_Test)
38
Break
39Workflow Diagram for Machine
Learning in Predictive Modeling
https://www.researchgate.net/figure/Training-and-testing-our-machine-learning-approach_fig2_318132501
• Step 1: split labeled dataset into
training and test datasets
• Step 2: select and fit a
predictive model to data
• Step 3: predict test dataset
• Step 4: evaluate the model
performance
40
Popular Algorithms for Classification
• Logistic Regression
•Decision Tree
•Support Vector Machine
•Artificial Neural Network
41
Logistic Regression
• Linearly separable classes
• A linear model for binary classification
• Predict the probability of outcomes by fitting the relationships among inputs
Spam detection example
Decision boundary
42
Logistic Regression
•Suppose we want to output the probability of an email being
spam/ham instead of just 0 or 1
43
Logistic Regression
•Suppose we want to output the probability of an email being
spam/ham instead of just 0 or 1
•Use a sigmoid function = 1
1+−
that maps the linear
combination of features to a value between 0 and 1
44
Logistic Regression
• Linear regression
• Logistic regression
Odds ratio
45
Logistic Regression in Python
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver = 'liblinear',
random_state = 0) #initialize a model object
An instance of a trained Scikit Learn Model
classifier.fit(X_Train, Y_Train)
print(classifier.classes_)
print(classifier.intercept_)
print(classifier.coef_)
46Workflow Diagram for Machine
Learning in Predictive Modeling
https://www.researchgate.net/figure/Training-and-testing-our-machine-learning-approach_fig2_318132501
• Step 1: split labeled dataset into
training and test datasets
• Step 2: select and fit a
predictive model to data
• Step 3: predict test dataset
• Step 4: evaluate the model
performance
47
•Predict the probability of future purchase behaviors
• Label future purchase behaviors
Logistic Regression in Python
Y_Probs = classifier.predict_proba(X_Test)
Y_Pred = classifier.predict(X_Test)
48Workflow Diagram for Machine
Learning in Predictive Modeling
https://www.researchgate.net/figure/Training-and-testing-our-machine-learning-approach_fig2_318132501
• Step 1: split labeled dataset into
training and test datasets
• Step 2: select and fit a
predictive model to data
• Step 3: predict test dataset
• Step 4: evaluate the model
performance
49
No Free Lunch
•No Free Lunch (NFL) Theorems (Wolpert, 1992a, 1996b;
Schaffer, 1994) are that we can't get learning “for free”
•Each classification algorithm has its inherent biases
•No single classification model enjoys superiority if we don't
make any assumptions about the task
•Compare different algorithms and select the best performing
model
•Decide upon a metric to measure performance
50
Model Evaluation
•Model evaluation is the process of using different evaluation
metrics to understand a machine learning model's
performance, as well as its strengths and weaknesses.
•A classifier/classification algorithm assigns an instance to one
of a predefined set of categories or classes.
•Binary classification evaluation metrics
⎻ Confusion Matrix
⎻ Calculate Precision, Recall and F1 score
51
Binary Classification Evaluation Metrics--Confusion matrix
Standard evaluation measures is
classification accuracy:
52Is accuracy an adequate measure of
predictive performance?
Accuracy may not be useful measure in cases where
• there is a large class skew
⎻ Is 98% accuracy good if 97% of the instances are negative?
• there are different misclassification costs—getting a positive
wrong costs more than getting a negative wrong
We are most interested in a subset of high-confidence
predictions
53
Confusion Matrix
True Positive Rate (TPR) (recall):
False Positive Rate (FPR):
54
ROC Curves
• Receiver operating characteristic
(ROC) curves plots the TP-rate vs.
the FP-rate as a threshold on the
confidence of an instance being
positive is varied
• The area under the ROC curve (AUC)
consists of the entire two-dimensional
area underneath the ROC curve.
Ideal point
Expected curve for
random guessing
55
ROC Curves
• Most classifiers predict a score (a real-valued number or a probability)
• Can set a threshold to decide what to call positive and what to call
negative
• Can adjust the threshold to control the TPR and FPR
56
•Various other metrics are also used to evaluate
classification performance
•Precision; Recall; F1-score
https://en.wikipedia.org/wiki/Precision_and_recall
Binary Classification Evaluation Metrics
57
Binary Classification Evaluation in Python
•Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_Test, Y_Pred)
print('Confusion matrix\n\n', cm)
print('\nTrue Positives(TP) = ', cm[0,0]) #Actual Positive:1 and Predict Positive:1
print('\nTrue Negatives(TN) = ', cm[1,1]) #Actual Negative:0 and Predict Negative:0
print('\nFalse Positives(FP) = ', cm[0,1]) #Actual Negative:0 but Predict Positive:1
print('\nFalse Negatives(FN) = ', cm[1,0]) #Actual Positive:1 but Predict Negative:0
58
Binary Classification Evaluation in Python
•Accuracy
from sklearn.metrics import accuracy_score
print('Model accuracy score: {0:0.4f}'.format(accuracy_score(Y_Test, Y_Pred)))
•AUC
from sklearn.metrics import roc_auc_score
ROC_AUC = roc_auc_score(Y_Test, Y_Pred)
print('ROC AUC : {:.4f}'.format(ROC_AUC))
59
•Plot ROC curve
Binary Classification Evaluation in Python
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(Y_Test,
Y_Probs[:,1], pos_label=1)
plt.figure(figsize = (6,4))
plt.plot(fpr, tpr, linewidth = 2)
plt.plot([0,1], [0,1], 'k--')
plt.rcParams['font.size'] = 12
plt.title('ROC curve for Purchase Classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.show()
60
•Precision: Accuracy of positive predictions
•Recall: the ability of a classifier to find all positive instances
• F1-score: a weighted harmonic mean of precision and recall
such that the best score is 1.0 and the worst is 0.0
Binary Classification Evaluation in Python
from sklearn.metrics import precision_score, recall_score
print(precision_score(Y_Test, Y_Pred))
print(recall_score(Y_Test, Y_Pred))
from sklearn.metrics import f1_score
print(f1_score(Y_Test, Y_Pred))
61
Popular Algorithms for Classification
• Logistic Regression
•Decision Tree (Self-reading)
•Support Vector Machine
•Artificial Neural Network
62
Python scikit-learn
•Popular machine learning toolkit in Python (http://scikit-
learn.org/stable/)
•Supervised learning models using scikit-learn library
(https://scikit-
learn.org/stable/supervised_learning.html#supervised-
learning)
Questions
Source: stacker.com