jupyter代写-ECON7940

时间：2021-04-10

HKBU – ECON7940
Data Driven Decision Making:
Bootcamp Day 1 & Day 2
Please respect our intellectual property and do not
distribute the material to anyone outside the class

Problem Define (Cont’)
Problem Define: Impact Assessment
There are 2 ways to define the success of your data project. Then we
can define our success metric.
1. Success in business terms
• E.g. increasing the sale of X product by 15%
2. Success in technical terms
• E.g. 10% increase in the F1 score of the churn model
The success metric makes the impact measurable.
A starting point to have a good discussion with business owners:
• To quantify the possible improved value in the metric as
business value
o For example: with 1% improvement in conversion, how
much revenue your business will earn
o Find the threshold as the “break-even”/significant points

Problem Define: Timeline Estimation
A reasonable timeline estimation can largely resolve the tensions
between senior managements and data specialists

Challenges in Timeline Estimation
1. Iterative nature of DS processes
2. Unknown in legacy systems
3. Uncertainty in model performance
4. Communication with data owners
5. Domain specific nature of tasks
6. Uncertain complexity of the tasks

Influential Factors for Time Estimation
1. Data Collection
a. Are the data assets owned by different parties? (Data Silos)
b. Do you need to consolidate different data sources?
c. Is there any data catalog which refers you where you can
find the data?
2. Data Exploration
a. Do you have the domain knowledge in the data?
3. Modeling
a. The maturity level of the technology?
(mature/experimental/cutting-edge)
b. Is the model computationally expensive?
4. Delivery
a. Is there any compliance process in the data request?
5. Action
a. Do you need to deploy the model on premise? Or cloud
platforms?
b. How many people use the model simultaneously?
c. How frequently the model run?
6. Feedback-loop Improvement
a. Is the monitoring infrastructure already established?
b. Any potential bias needs to be handled beforehand?

Data Collection
Where is the data?
• Internal data
• External open source data
• Purchased data
• Collecting data from scratch
Where is the relevant data?
1. Most granular level data
2. The intrinsic characters of the key features
3. All other associated data

How is the data situation in your organization?

The ways to collect Data internally from scratch
• Transactional Data
o The most common and basic data you should have in your
organization
• Registration & Subscription data
o Require some basic information from customers or visitors
who want to sign up for your email list, rewards program,
etc.
• Online Tracking
o In your organization’s website or the app
o Allows you to see how many people visited your site, how
long they were on it, what they clicked on and more
• Surveys
o The way in which you can directly ask customers for
information
• Online Marketing Analytics
o From online marketing campaigns. For examples, Google
search, Facebook, Youtube, web traffic, email or elsewhere
• Social Media Monitoring
o Data in your follower list to see who follows you and what
characteristics they have in common to enhance your
understanding of who your target audience should be
• In-Store Traffic Monitoring
o Monitor the foot traffic in the stores.

What if your data is inconsistent?
• The inconsistent data should be dealt by Master Data
Management (MDM)

The common scenario – Data Silos
The common data scenario in organization is Data Silos.
”These silos are isolated islands of data, and they make it prohibitively
costly to extract data and put it to other uses” (excerpted from HBR)
The reasons of Data Silos (from HBR):
• Structural
• Political
• Legacy
• Vendor lock-in

Practical advice to get the data when they are all over the place
• Be nice
• Start with small request first
• Get prepared with some tech knowledge before
• Allow enough time to do it, it takes time

If you are in heaven, just access the data lake with official
processes
The common official processes:
1. Checking the suitable datasets in Data Mart
2. Select the ones and request for the access in the system (or
internal procedure), give a solid reason why you need it.
3. Wait for the approval.

Source: https://www.lotame.com/what-are-the-methods-of-data-
collection/

Data Exploration
Data Exploration
• Data Exploration includes the iterative processes of:
– Data Understanding
• Knowing the characteristics of data and spotting the
strange cases
– Data Processing
• Handling the data right to produce meaningful
information
• After these processes, commonly we will deliver a Data Quality
Report

Major Steps of Data Exploration
• Feature meaning and relationships
• Data type check
• Statistical report
• Target check
• Missing value detection/ handling
• Duplication detection/ handling
• Outlier detection/ handling
• Re-formatting
• Encoding
• Feature extraction
• Feature transformation
• Feature creation
• Feature selection

Feature meaning and relationships
Feature Understanding is the most critical task in data exploration
1. Understand the meaning of each feature
2. Understand the relationships between features
3. Understand the relationships between different data sources

Data type check
Check the data type for each feature as it determines the possible
processing steps in later stages

Statistical Report
Describe the data
1. Summary statistics
provide a quick and simple description of the data
2. Unique value check
check the unique values for each feature
3. Distribution check
plot the distribution for each feature

Target check
Check the shape of target variable
1. Classification
a. Is it balanced?
2. Regression
a. Is it normal, skewed or multimodal?
Missing Data Detection and Handling
1. Detection
a. Check any column without value
b. Check any strange value in the records e.g. empty string, very
large/small value
2. Handling
a. Impute the missing data with reasonable value
b. Remove the feature if it provides very limited information to
the model
Outlier Detection and Handling
1. Detection
a. Calculate the standard score (the standard deviation from
the mean) for each data point
b. Set a threshold (e.g. -3 & 3) and label the data points outside
this range as outliers
2. Handling
a. Remove the outlier
b. cap the outlier to a certain value
Duplication Detection and Handling
1. Detection
a. Ask yourself: does it make sense to have same value in every
column?
b. Compare the value in every column
2. Handling
a. Remove the duplications
Re-formatting
1. Check any mismatch on data type and the corresponding ”real
meaning”
2. Change the format to force the algorithm to understand:
a. Text data with only limited number of unique values
àchange to categorical data
b. Text data but obviously datetime àchange to data and
time
Encoding
1. Identify the features which are not yet numeric
2. To decide the type of encoding, you need to know the data type
and algorithm you are going to use:
a. If the feature is categorical and you are going to use
decision tree, you can use label encoding
b. If the feature is categorical and you are going to use linear
model or neural network, you can use one-hot encoding
c. If the feature is text, you can use Bag of Words or TF-IDF

Feature Extraction
Feature extraction is a process of dimensionality reduction which
reduces the number of features by creating the new features from
the existing ones.
Some common feature extraction techniques are:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis
• Latent Dirichlet Allocation
• t-SNE
• etc.

Feature Transformation
1. Identify the numeric features which the range or scale are not
desirable
2. To decide the type of transformation, you need to know the
algorithm you are going to use:
a. If you are going to use decision tree, no need to do
transformation
b. If you are going to use linear model or neural network,
standardization is always a good choice
c. If the feature distribution has long tail, you can use log
transformation
Feature Creation
Feature Creation is to create features from data using domain knowledge

Feature Selection
Feature Selection is removing non-informative or redundant predictors
from the model
1. Select well perform features
2. Select based on feature importance
3. Select based on correlation

Separation of “X & y”
The variables for prediction: X – also called:
• Features
• Independent variables
• Attributes
• Fields
• Etc.
The predicted values: y – also called:
• Target variables
• Dependent variables
• Ground truth
• Etc.

Data Leakage
A leak is a situation where a variable (feature) collected in historical data
gives information on the target variable (predicted value)

Modeling
Choosing a proper algorithm is not easy as there are two challenges:
1. “No Free Lunch” Theorem
there is no one model that works best for every problem
2. Balance between different objectives
accuracy is not the only objective
4-Objective Principle
There are 4 main objectives to balance when choosing the algorithm:
1. Accuracy
2. Latency
3. Resource
4. Explainability

Modeling Workflow
1. Setup Baseline
Evaluate the difficulty of the problem
2. Select algorithms
Find a set of candidate algorithms
3. Development
Develop the selected algorithms
4. Hyperparameter Tuning
Find the best configuration of the model

Setup Baseline
Baseline Model is a naïve model which evaluate the difficulty of the
problem and assess the feasibility of reaching a production standard

Common baseline:
1. Random prediction
2. Human Performance
3. Simple ML algorithm

Procedure:
1. Split the dataset into training and testing sets. Training set is used
to train the model and the testing set is withheld until the
evaluation
2. Select a baseline algorithm to train a baseline model with the
training set
3. Use the trained baseline model to predict the result in the testing
set
4. Compare the prediction result in the testing set with the ground
truth and evaluate the performance of the baseline model

Select Algorithms
1. Follow the algorithm roadmap
2. Disqualify some candidates using the 4-objective principle

Procedure:
1. Replace the baseline model with the candidate in the pool one by
one
2. Evaluate the candidate model and compare the model with the
baseline score

Evaluation
Evaluation is a process to estimate the model performance by
simulate the real-world environment

1. Offline Evaluation
Evaluate the model performance by a withheld unseen
dataset
2. Online Evaluation
Evaluate the model performance by allocating a small
proportion of real-world traffic in the production
environment

Offline Evaluation
Split the dataset set into training and validation set so that the
validation set can serve as the unseen dataset for evaluation

Online Evaluation
Compares model A and model B by allocating traffic to two models

Regression Metrics
Metrics Interpretation Characteristics
MSE (Mean
Squared Error)
The average squared
error between the
truth and the
prediction
1. It put higher weight on bigger errors
than smaller errors
2. it is sensitive to outliers
MAE (Mean
Absolute Error)
The average of the
absolute value of error
1. The weight of error is the same for big
error and small error
2. it is difficult to optimize
MAPE (Mean
Absolute
Percentage Error)
The average error in
percentage
1. It is easy to interpret
2. Magnitude of the error does not count
3. It is insensitive to outliers
R-squared The percentage of
variety which can be
explained by the model
1. Easy to understand
2. Lack information of the magnitude of
error

Classification Metrics
Metrics Interpretation Characteristics
Accuracy The percentage of
correct prediction for
all classes
1. Easy to understand
2. Consider all classes
3. Weights are proportional to class size,
which may not be desirable sometimes
Precision For every 100
predicted positive by
the model, how many
of them are actually
positive
1. Focus on the positive class
2. Applicable when the cost of executing
the prediction is huge
Recall For every 100 positive
data points, how many
of them are predicted
positive
1. Focus on the positive class
2. Applicable when the cost of missing a
positive is huge
F1-Score A balance between
Precision and Recall
1. Focus on the positive class
2. Applicable when the positive class is
much more important and is the
minority class
*AUC (Area Under
Curve)
How well the positive
class can be separate
from the negative class
1. Robust to imbalanced distribution
2. Focus on the ranking rather than the
actual prediction

*Reference links for AUC concept:
https://github.com/dariyasydykova/open_projects/tree/master/ROC_animation (Dynamic
illustration)
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Confusion Matrix
Confusion matrix is a commonly used metric to for classification
problems

Precision and Recall
Precision: for those model predicted as positive, how many of them
are truly positive?
Recall: for those which are truly positive, how many of them are
predicted as positive?
Actual Class
Predicted
Class

F1 Score
F1 Score is the balance of Precision and Recall

Model Refinement
Hyper-parameter is a configuration to the model
Hyper-parameter tuning is a process to try different configurations
for the same model to get the best performance

Actual Class
Predicte
d Class
Precision ＝ !!"# = !$% = 0.38
Recall = !!"& = !' = 0.71
Actual Class
Predicte
d Class
F1 Score
= !"#$%&'&()!"*#$%+,,!"
= !"#$ *%$
= -! = 0.5
The ways to get a better result

If you want to improve the trained model, you can try as below:
• Hyperparameter-tuning for the current model
• Trying the other models (even can ensemble this one with the
others)
• Going back to work on the data again, including getting more
data, trying different feature engineering, etc.

Hyper-parameter searching methods

Grid search
• Grid search across different values of the hyperparameters with
the specific sets of hyperparameter’s space
Random search
• Random search across different combinations of values for
the hyperparameters with random choices
Bayesian optimization
• By evaluating hyperparameters that appear more promising
from past results, Bayesian methods can find better model
settings than random search in fewer iterations.

Delivery
Data Storytelling
Six key concepts in “Storytelling with Data” (by Cole Nussbaumer
Knaflic)

• Understand the context
• Choose an appropriate visualization
• Decluttering
• Focus your audience attention
• Think like a designer
• Make it as a story

1. Understand the context

Who: Who are you communicating to
What: What do you need them to know or do
How: How will you communicate with them

2. Choose an appropriate visualization

Simple text – When you have just a number or two to share.
Tables – To communicate various units of measure.
Heat map – To visualize data in tabular format but leveraging color to
convey the relative magnitude of the numbers.
Scatterplot – Showing the relationship between two things
(measures).
Lines – Most commonly used to plot data in time.
Slope graph – Useful when you have two time periods of comparison
Column chart (vertical) – Easy to see quickly which category is the
biggest, smallest, and the incremental difference between categories.
Stacked Column chart (vertical) – Used to show the totals across
different categories but also subcomponent pieces vertically.
Bar chart (horizontal) – Similar to column chart but horizontal, use if
your category names are long.
Stacked bar chart (horizontal) – Used to show the totals across
different categories but also subcomponent pieces.
Waterfall chart – Used to pull apart the pieces of a stacked bar chart
to show a starting point, differences, and the ending point.

3. Decluttering

• Less is more
• Alignment, emphasis & white space

4. Focus your audience attention

• Size
• Color
o “One level” focus & “Two level” focus
o Use color sparingly
• Position
o Always put the important information at the top

5. Think like a designer

Make your graph as simple as possible
Consider affordances – Make things easy to understand
• Highlight important stuff
• Eliminate distractions
• Create a clear hierarchy of information
• Use annotations

6. Make it as a story (in short)

Make your presentation as a story by breaking it into 3 parts:
• Beginning (plot) - “Building the context for your audience”
• Middle (twists) - “Addressing how they can solve the problem
you introduced, You’ll work to convince them why they should
accept the solution you are proposing or act in the way you
want them to”
• End (call to action) - “Make it totally clear to your audience what
you want them to do with the new understanding or knowledge
that you’ve imparted to them.”

Result Communication
Storytelling

Storytelling elements
• Repetition
• Beginning, middle and end (“Plot, twist, call to action”, or
“setup, conflict and resolution”)
• Emotional
The plot
• Using “setup, conflict and resolution” and gathering what you
have done

Outline the content

2 levels of presentations:
• Whole story (long)
• Elevator pitch (very short)
Tool:
• Storyboard with post-it or on the paper
• create headlines
• play with the orders

Know your audience and refine your visualizations

• Know clearly about “Who, What & How”
• Make everything on point

Wrap them up in a good story

Beginning (plot)
• Background
• Users
• Unresolved problem
• Desired outcome
And think of the conflict or the audience’s problem, focusing on
“What is happening, why is it important and what you want to change”

Middle (twist)
• Tell them how you can help them solve the problem you
mentioned
• Tell what will happen if no action is taken
• Give them options to solve the problem

End (Call to action)

• Very clear what you want the stakeholders to do with the new
information/insights you provided

Source:
• “Storytelling with Data” by Cole N. Knaflic
• https://vizard.co/storytelling-with-data/

Action
Production is the final stage of model development. A productionized
ML service is a model ready to be consumed by end users/other
systems
Deployment is the process to transform the POC model to
productionized model

Decisions in Deployment
1. when
a. What is the deployment mode for training and
prediction?
2. How
a. How can we implement the deployment mode?
b. What is the deployment architecture?
c. Can our infrastructure support it?
3. Where
a. Where can the computation take place? Edge device,
server or database?

Deployment Mode

Train
• One Off
only train once
• Batch
train in a certain frequency
• Online
train continuously
Predict
• Batch
predict in a certain frequency
• Real time
predict whenever there is event trigger

Considerations of Deployment Mode

Train
1. Is the application time sensitive?
2. How rapidly should the application adapt to the latest trend?
3. Does the distribution (event patterns) change drastically?
Predict
1. Is the prediction impossible to pre-compute?
2. Is the volume of the prediction too large that batch
prediction cannot be finished in a reasonable time?
3. Can the prediction be finished in a reasonable duration?

Advantages of Real Time
• Fast adaption to distribution change
• Instant response to various inputs

Challenges of Real Time
• Data
The features should be extracted on demand, which imposes
extra design complexity to the database
• Infrastructure
Infrastructure should be able to handle the load peak and be
available anytime
• Development
Real time application is much more difficult to develop and
debug
• Cost
More computational power is needed and therefore the cost is
higher

Batch Deployment
Batch deployment usually uses a deployment platform to schedule
and monitor jobs

• You only need to define the logic and dependencies of jobs
• The deployment platform will take care of the infrastructure
issues

Real Time Deployment
Three components of real time deployment
1. Event
event is the trigger of prediction. So, what is the source of
event?
2. Input
where can the model find its input? It is from the event or
somewhere else?
3. Model
where is the model deployed? E.g. server, database, big data
streaming platform, edge device etc

Feedback loop improvement
Two components to continuous improvement:
1. Monitoring
Keep track of the state of model to prevent performance
drop
2. Closed Feedback Loop
Build the closed feedback loop so that model can continue its
learning

Four Aspects of Monitoring
1. Outlier Detection
Identify the potential outliers in data and infra-structure
perspectives
2. Drift Detection
Identify the potential drift in data distribution which may
make the prediction fail
3. Performance Monitoring
Monitor the performance of model and infra-structure
4. Health Check
Monitor the state of jobs and infra-structure

Closed Feedback Loop

Closed Feedback Loop is the source of power of ML as it enables
continuous improvement:
1. Model predicts the result based on training data from the
clients
2. Clients use the model prediction and their response serve as
new training data

Negative Feedback Loop
Closed feedback loop can also have negative effect as:
1. Closed feedback loop may also accumulate bias
2. The model may capture a pattern and reinforce it even
though the pattern is not ethical

学霸联盟