MANG 6554
Advanced Analytics
Libo Li
2021-2022
Libo.li@soton.ac.uk
LL
1
Learning objectives
LL
2
• Gain a basic understanding of Bayesian theorem
• Get familiar with Bayesian classifier (Naïve Bayes and Bayesian belief
networks)
• Develop a philosophical understanding of Bayesian inference and its
relation to statistical model formulations.
LL Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to data mining. Pearson Education India,
2016.
3
Bayes Classifier
A probabilistic framework for solving classification problems
Conditional Probability:
Bayes theorem:
)(
)()|(
)|(
XP
YPYXP
XYP =
)(
),(
)|(
)(
),(
)|(
YP
YXP
YXP
XP
YXP
XYP
=
=
LL
4
Example of Bayes Theorem
Given:
• A salesperson knows that if a customer has already brought a keyboard, 50%
of the chance he/she will buy a mouse.
• Prior probability of any customer purchasing a keyboard is 1/50
• Prior probability of any customer purchasing a mouse is 1/20
If a customer has brought a mouse, what’s the probability
he/she buys a keyboard?
2.0
20/1
50/15.0
)(
)()|(
)|( =
==
mP
kPkmP
mkP
LL
5
Using Bayes Theorem for Classification
Consider each attribute and class label as random variables
Given a record with attributes (X1, X2,…, Xd)
• Goal is to predict class Y
• Specifically, we want to find the value of Y that maximizes P(Y| X1,
X2,…, Xd )
Can we estimate P(Y| X1, X2,…, Xd ) directly from data?
LL
6
Example Data
120K)IncomeDivorced,No,Refund( ===X
Given a Test Record:
Can we estimate
P(Evade = Yes | X) and P(Evade = No | X)?
In the following we will replace
Evade = Yes by Yes, and
Evade = No by No
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
7
Using Bayes Theorem for Classification
Approach:
• compute posterior probability P(Y | X1, X2, …, Xd) using the
Bayes theorem
• Maximum a-posteriori: Choose Y that maximizes
P(Y | X1, X2, …, Xd)
• Equivalent to choosing value of Y that maximizes
P(X1, X2, …, Xd|Y) P(Y)
How to estimate P(X1, X2, …, Xd | Y )?
)(
)()|(
)|(
21
21
21
d
d
n
XXXP
YPYXXXP
XXXYP
=
LL
8
Example Data
120K)IncomeDivorced,No,Refund( ===X
Given a Test Record:
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
9
Naïve Bayes Classifier
Assume independence among attributes Xi when class is given:
• P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)
• Now we can estimate P(Xi| Yj) for all Xi and Yj combinations from the training
data
• New point is classified to Yj if P(Yj) P(Xi| Yj) is maximal.
LL
10
Naïve Bayes on Example Data
120K)IncomeDivorced,No,Refund( ===X
Given a Test Record:
P(X | Yes) =
P(Refund = No | Yes) x
P(Divorced | Yes) x
P(Income = 120K | Yes)
P(X | No) =
P(Refund = No | No) x
P(Divorced | No) x
P(Income = 120K | No)
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
11
Estimate Probabilities from Data
Class: P(Y) = Nc/N
– e.g., P(No) = 7/10,
P(Yes) = 3/10
For categorical attributes:
P(Xi | Yk) = |Xik|/ Nc
– where |Xik| is number of
instances having attribute
value Xi and belonging to
class Yk
– Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
12
Estimate Probabilities from Data
For continuous attributes:
– Discretization: Partition the range into bins:
Replace continuous value with bin value
Attribute changed from continuous to ordinal
– Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, use it to estimate the conditional probability P(Xi|Y)
LL
13
Estimate Probabilities from Data
Normal distribution:
– One for each (Xi,Yi) pair
For (Income, Class=No):
– If Class=No
sample mean = 110
sample variance = 2975
2
2
2
)(
2
2
1
)|( ij
ijiX
ij
ji eYXP
−
−
=
0072.0
)54.54(2
1
)|120(
)2975(2
)110120(
2
===
−
−
eNoIncomeP
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
14
Example of Naïve Bayes Classifier
120K)IncomeDivorced,No,Refund( ===X
P(X | No) = P(Refund=No | No)
P(Divorced | No)
P(Income=120K | No)
= 4/7 1/7 0.0072 = 0.0006
P(X | Yes) = P(Refund=No | Yes)
P(Divorced | Yes)
P(Income=120K | Yes)
= 1 1/3 1.2 10-9 = 4 10-10
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Given a Test Record:
Naïve Bayes Classifier:
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/7
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
P(No) = 7/10
P(Yes) = 3/10
(|) =
(|)()
()
(|) =
(|)()
()
+ = 1
LL
15
Example of Naïve Bayes Classifier
120K)IncomeDivorced,No,Refund( ===X
P(Yes) = 3/10
P(No) = 7/10
P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced)
P(No | Divorced) = 1/7 x 7/10 / P(Divorced)
P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10 /
P(Divorced, Refund = No)
P(No | Refund = No, Divorced) = 4/7 x 1/7 x 7/10 /
P(Divorced, Refund = No)
Given a Test Record:
Naïve Bayes Classifier:
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/7
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
LL
16
Issues with Naïve Bayes Classifier
P(Yes) = 3/10
P(No) = 7/10
P(Yes | Married) = 0 x 3/10 / P(Married)
P(No | Married) = 4/7 x 7/10 / P(Married)
Naïve Bayes Classifier:
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/7
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
LL
17
Issues with Naïve Bayes Classifier
Naïve Bayes Classifier:
P(Refund = Yes | No) = 2/6
P(Refund = No | No) = 4/6
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/6
P(Marital Status = Divorced | No) = 0
P(Marital Status = Married | No) = 4/6
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0/3
For Taxable Income:
If class = No: sample mean = 91
sample variance = 685
If class = No: sample mean = 90
sample variance = 25
Consider the table with Tid = 7 deleted
Given X = (Refund = Yes, Divorced, 120K)
P(X | No) = 2/6 X 0 X 0.0083 = 0
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0
Naïve Bayes will not be able to
classify X as Yes or No!
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
18
Issues with Naïve Bayes Classifier
If one of the conditional probabilities is zero, then the entire expression
becomes zero
Need to use other estimates of conditional probabilities than simple fractions
Probability estimation:
mN
mpN
CAP
cN
N
CAP
N
N
CAP
c
ic
i
c
ic
i
c
ic
i
+
+
=
+
+
=
=
)|(:estimate-m
1
)|(:Laplace
)|( :Original
c: number of classes
p: prior probability of the
class
m: parameter
Nc: number of instances in the
class
Nic: number of instances
having attribute value Ai in
class c
LL
19
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
python no no no no non-mammals
salmon no no yes no non-mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
bat yes yes no yes mammals
pigeon no yes no yes non-mammals
cat yes no no yes mammals
leopard shark yes no yes no non-mammals
turtle no no sometimes yes non-mammals
penguin no no sometimes yes non-mammals
porcupine yes no no yes mammals
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals
platypus no no no yes mammals
owl no yes no yes non-mammals
dolphin yes no yes no mammals
eagle no yes no yes non-mammals
Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ?
0027.0
20
13
004.0)()|(
021.0
20
7
06.0)()|(
0042.0
13
4
13
3
13
10
13
1
)|(
06.0
7
2
7
2
7
6
7
6
)|(
==
==
==
==
NPNAP
MPMAP
NAP
MAP
A: attributes
M: mammals
N: non-mammals
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
LL
20
Naïve Bayes (Summary)
Robust to isolated noise points
Handle missing values by ignoring the instance during probability
estimate calculations
Robust to irrelevant attributes
Independence assumption may not hold for some attributes
– Use other techniques such as Bayesian Belief Networks (BBN)
LL
21
Conditional Independence
X and Y are conditionally independent given Z if P(X|YZ) = P(X|Z)
Example: Arm length and reading skills
– Young child has shorter arm length and limited reading skills, compared to
adults
– If age is fixed, no apparent relationship between arm length and reading skills
– Arm length and reading skills are conditionally independent given age
LL
22
Bayesian Belief Networks
Provides graphical representation of probabilistic relationships among a
set of random variables
Consists of:
– A directed acyclic graph (dag)
Node corresponds to a variable
Arc corresponds to dependence
relationship between a pair of variables
– A probability table associating each node to its immediate parent
A B
C
LL
23
Conditional Independence
A node in a Bayesian network is conditionally
independent of all of its nondescendants, if its parents
are known
A B
C
D
D is parent of C
A is child of C
B is descendant of D
D is ancestor of A
LL
24
Conditional Independence
Naïve Bayes assumption:
...X
1
X
2
X
3
X
4
y
X
d
LL
25
Probability Tables
If X does not have any parents, table contains prior probability P(X)
If X has only one parent (Y), table contains conditional probability
P(X|Y)
If X has multiple parents (Y1, Y2,…, Yk), table contains conditional
probability P(X|Y1, Y2,…, Yk)
Y
X
LL
26
Example of Bayesian Belief Network
Exercise Diet
Heart
Disease
Chest Pain
Blood
Pressure
Exercise=Yes 0.7
Exercise=No 0.3
Diet=Healthy 0.25
Diet=Unhealthy 0.75
E=Healthy
D=Yes
E=Healthy
D=No
E=Unhealthy
D=Yes
E=Unhealthy
D=No
HD=Yes 0.25 0.45 0.55 0.75
HD=No 0.75 0.55 0.45 0.25
HD=Yes HD=No
CP=Yes 0.8 0.01
CP=No 0.2 0.99
HD=Yes HD=No
BP=High 0.85 0.2
BP=Low 0.15 0.8
LL
27
Example of Inferencing using BBN
Given: X = (E=No, D=Yes, CP=Yes, BP=High)
– Compute P(HD|E,D,CP,BP)?
P(HD=Yes| E=No,D=Yes) = 0.55
P(CP=Yes| HD=Yes) = 0.8
P(BP=High| HD=Yes) = 0.85
– P(HD=Yes|E=No,D=Yes,CP=Yes,BP=High)
0.55 0.8 0.85 = 0.374
P(HD=No| E=No,D=Yes) = 0.45
P(CP=Yes| HD=No) = 0.01
P(BP=High| HD=No) = 0.2
– P(HD=No|E=No,D=Yes,CP=Yes,BP=High)
0.45 0.01 0.2 = 0.0009
Classify X
as Yes
LL https://support.sas.com/resources/papers/proceedings14/SAS400-2014.pdf
28
Bayesian inference
From Bayesian theorem, we recall:
If we generalize ℎ , ℎ
The following holds: =
(|)
()
∝ (|)
Posterior ∝ prior ×likelihood
)(
)()|(
)|(
XP
YPYXP
XYP =
LL
29
Bayesian inference
In a simple example, sales record D={1, 2, 3…}
We could estimate a simple parametrical model using D.
We assume this is a normal distribution (, 2)
- given the data, =? , 2=?
(|)- given the parameter, what is the likelihood of the data
Or more generally, we could test a set of 1,2,… and see how
the likelihood (|) changes given the different parameter
settings.
Easy question if you know how
to calculate mean and average
LL
30
Bayesian inference
- given the data
(|)- given the parameter, what is the likelihood of the data
Or more generally, we could test a set of 1,2,… and see how
the likelihood (|) changes given the different parameter
settings.
∝ (|)
Difficult question if you do not
know how to calculate
LL
31
Bayesian inference
=
(, )
()
(|)()
()
=
(|)()
Formulate a prior distribution () to express your beliefs about
Often, we do not know about the exact form of posterior distribution, hence
simulation techniques (Markov chain Monte Carlo (MCMC)) is needed to draw samples
to approximate the posterior distribution from a target distribution.
Popular MCMC algorithms: Metropolis–Hastings, Gibbs
sampling, Thompson sampling…
LL Bolstad, W.M. and Curran, J.M., 2016. Introduction to Bayesian statistics. John Wiley & Sons.
32
Bayesian inference
A Bayesian linear regression case:
= + +
∈ 0, 2
Inference over , , 2 allows us to
construct the regression model. Often,
there are hierarchies within the parametric
structure, e.g., (, |2) ∝ 1, (2) ∝
1
2
|, , , 2~( + , 2)
(|, , , 2)=
1
22
(−
(−(+))2
22
)
LL
33
Bayesian vs frequentist
Should we trust the p value?
Uncertainty in decision making e.g., hypothesis testing
https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-
probability-and-statistics-spring-
2014/readings/MIT18_05S14_Reading20.pdf
Hackenberger, B.K., 2019. Bayes or not Bayes, is this the
question?. Croatian medical journal, 60(1), p.50.
https://cxl.com/blog/bayesian-frequentist-ab-testing/