Python代写-CMPUT397

This work is licensed under a Creative Commons Attribution 4.0 International License CMPUT397 IIR 13: Text Classification & Naive Bayes Instructor: Denilson Barbosa University of Alberta January 11, 2021 Origial slides by Hinrich Schütze of the Ludwig-Maximilians-Universität München http://www.cis.lmu.de/~hs/teach/14s/ir/ http://informationretrieval.org 1 / 52 Overview Text classification Naive Bayes NB theory Evaluation of TC 2 / 52 Take-away today Text classification: definition & relevance to information retrieval Naive Bayes: simple baseline text classifier Theory: derivation of Naive Bayes classification rule & analysis Evaluation of text classification: how do we know it worked / didn’t work? 3 / 52 Text classification A text classification task: Email spam filtering From: ``''

Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for similar courses
I am 22 years old and I have already purchased 6 properties using the
methods outlined in this truly INCREDIBLE ebook.
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
How would you write a program that would automatically detect and delete this type of
message?
5 / 52
Formal definition of TC: Training
Given:
- A document space X
Documents are represented in this space – typically some type of
high-dimensional space.
- A fixed set of classes C = {c1, c2, . . . , cJ}
The classes are human-defined for the needs of an application
(e.g., spam vs. nonspam).
- A training set D of labeled documents. Each labeled document
⟨d, c⟩ ∈ X× C
Using a learning method or learning algorithm, we then wish to learn a
classifier γ that maps documents to classes:
γ : X→ C
6 / 52
Formal definition of TC: Application/Testing
Given: a description d ∈ X of a document Determine: γ(d) ∈ C, that is,
the class that is most appropriate for d
7 / 52
Topic classification
classes:
training
set:
test
set:
regions industries subject areas
γ(d′) =China
first
private
Chinese
airline
UK China poultry coffee elections sports
London
congestion
Big Ben
Parliament
the Queen
Windsor
Beijing
Olympics
Great Wall
tourism
communist
Mao
chicken
feed
ducks
pate
turkey
bird flu
beans
roasting
robusta
arabica
harvest
Kenya
recount
run­off
seat
campaign
baseball
diamond
soccer
forward
captain
team
d′
8 / 52
Exercise
Find examples of uses of text classification in information retrieval
9 / 52
Examples of how search engines use classification
Language identification (classes: English vs. French etc.)
The automatic detection of spam pages (spam vs. nonspam)
Sentiment detection: is a movie or product review positive or
negative (positive vs. negative)
Topic-specific or vertical search – restrict search to a “vertical” like
“related to health” (relevant to vertical vs. not)
10 / 52
Classification methods: 1. Manual
Manual classification was used by Yahoo in the beginning of the web.
Also: ODP, PubMed
Very accurate if job is done by experts
Consistent when the problem size and team is small
Scaling manual classification is difficult and expensive.
→ We need automatic methods for classification.
11 / 52
Classification methods: 2. Rule-based
There are IDE-type development enviroments for writing very complex
rules efficiently. (e.g., Verity)
Accuracy is very high if a rule has been carefully refined over time by a
subject expert.
Building and maintaining rule-based classification systems is
cumbersome and expensive.
12 / 52
A Verity topic (a complex classification rule)
13 / 52
Classification methods: 3. Statistical/Probabilistic
This was our definition of the classification problem – text
classification as a learning problem
(i) Supervised learning of a the classification function γ and (ii)
application of γ to classifying new documents
We will look at two methods for doing this: Naive Bayes and SVMs
No free lunch: requires hand-classified training data
But this manual classification can be done by non-experts.
14 / 52
Naive Bayes
The Naive Bayes classifier
• The Naive Bayes classifier is a probabilistic classifier.
• We compute the probability of a document d being in a class c as
follows:
P (c|d) ∝ P (c)

1≤k≤nd
P (tk|c)
• nd is the length of the document. (number of tokens)
• P (tk|c) is the conditional probability of term tk occurring in a
document of class c
• P (tk|c) as a measure of how much evidence tk contributes that c is
the correct class.
• P (c) is the prior probability of c.
• If a document’s terms do not provide clear evidence for one class vs.
another, we choose the c with highest P (c).
16 / 52
Maximum a posteriori class
Our goal in Naive Bayes classification is to find the “best” class.
The best class is the most likely or maximum a posteriori (MAP) class
cmap:
cmap = argmax
c∈C
Pˆ (c|d) = argmax
c∈C
Pˆ (c)

1≤k≤nd
Pˆ (tk|c)
17 / 52
Taking the log
Multiplying lots of small probabilities can result in floating point
underflow.
Since log(xy) = log(x) + log(y), we can sum log probabilities
Since log is a monotonic function, the class with the highest score
does not change.
So what we usually compute in practice is:
cmap = argmax
c∈C
[log Pˆ (c) +

1≤k≤nd
log Pˆ (tk|c)]
18 / 52
Naive Bayes classifier
Classification rule:
cmap = argmax
c∈C
[log Pˆ (c) +

1≤k≤nd
log Pˆ (tk|c)]
Simple interpretation:
Each conditional parameter log Pˆ (tk|c) is a weight that indicates
how good an indicator tk is for c.
The prior log Pˆ (c) is a weight that indicates the relative
frequency of c.
The sum of log prior and term weights is then a measure of how
much evidence there is for the document being in the class.
We select the class with the most evidence.
19 / 52
Parameter estimation take 1: Maximum likelihood
Estimate parameters Pˆ (c) and Pˆ (tk|c) from train data: How?
Prior:
Pˆ (c) = Nc
N
Nc: number of docs in class c;N : total number of docs
Conditional probabilities:
Pˆ (t|c) = Tct∑
t′∈V Tct′
Tct is the number of tokens of t in training documents from class c
(includes multiple occurrences)
We’ve made a Naive Bayes independence assumption here:
Pˆ (tk|c) = Pˆ (tk|c), independent of position
20 / 52
The problem with maximum likelihood estimates: Zeros
C=China
X1=Beijing X2=and X3=Taipei X4=join X5=WTO
P (China|d) ∝ P (China) · P (Beijing|China) · P (and|China)
· P (Taipei|China) · P (join|China) · P (WTO|China)
If WTO never occurs in class China in the train set:
Pˆ (WTO|China) =
TChina,WTO∑
t′∈V TChina,t′
= 0∑
t′∈V TChina,t′
= 0
21 / 52
The problem with maximum likelihood estimates: Zeros (cont)
If there are no occurrences of WTO in documents in class China, we get
a zero estimate:
Pˆ (WTO|China) =
TChina,WTO∑
t′∈V TChina,t′
= 0
→ We will get P (China|d) = 0 for any document that contains WTO!
22 / 52
Before:
Pˆ (t|c) = Tct∑
t′∈V Tct′
Now: Add one to each count to avoid zeros:
Pˆ (t|c) = Tct + 1∑
t′∈V (Tct′ + 1)
= Tct + 1(∑t′∈V Tct′) +B
B is the number of bins – in this case the number of different words
or the size of the vocabulary |V | = M
23 / 52
Naive Bayes: Summary
Estimate parameters from the training corpus using add-one
smoothing
For a new document, for each class, compute sum of (i) log of prior
and (ii) logs of conditional probabilities of the terms
Assign the document to the class with the largest score
24 / 52
Naive Bayes: Training
TRAINMULTINOMIALNB(C,D)
1 V ← EXTRACTVOCABULARY(D)
2 N ← COUNTDOCS(D)
3 for each c ∈ C
4 doNc ← COUNTDOCSINCLASS(D, c)
5 prior[c] ← Nc/N
6 textc ← CONCATENATETEXTOFALLDOCSINCLASS(D, c)
7 for each t ∈ V
8 do Tct ← COUNTTOKENSOFTERM(textc, t)
9 for each t ∈ V
10 do condprob[t][c] ← Tct+1∑
t′ (Tct′+1)11 return V, prior, condprob
25 / 52
Naive Bayes: Testing
APPLYMULTINOMIALNB(C, V, prior, condprob, d)
1 W ← EXTRACTTOKENSFROMDOC(V, d)
2 for each c ∈ C
3 do score[c] ← log prior[c]
4 for each t ∈W
5 do score[c]+ = log condprob[t][c]
6 return argmax
c∈C
score[c]
26 / 52
Exercise: Estimate parameters, classify test set
docID words in document in c = China?
training set 1 Chinese Beijing Chinese yes
2 Chinese Chinese Shanghai yes
3 Chinese Macao yes
4 Tokyo Japan Chinese no
test set 5 Chinese Chinese Chinese Tokyo Japan ?
Pˆ (c) = Nc
N
Pˆ (t|c) = Tct + 1∑
t′∈V (Tct′ + 1)
= Tct + 1(∑t′∈V Tct′) +B
(B is the number of bins – in this case the number of different words or the size of the
vocabulary |V | = M )
cmap = argmax
c∈C
[Pˆ (c) ·

1≤k≤nd
Pˆ (tk|c)]
27 / 52
Example: Parameter estimates
Priors: Pˆ (c) = 3/4 and Pˆ (c) = 1/4 Conditional probabilities:
Pˆ (Chinese|c) = (5 + 1)/(8 + 6) = 6/14 = 3/7
Pˆ (Tokyo|c) = Pˆ (Japan|c) = (0 + 1)/(8 + 6) = 1/14
Pˆ (Chinese|c) = (1 + 1)/(3 + 6) = 2/9
Pˆ (Tokyo|c) = Pˆ (Japan|c) = (1 + 1)/(3 + 6) = 2/9
The denominators are (8 + 6) and (3 + 6) because the lengths of textc
and textc are 8 and 3, respectively, and because the constant B is 6 as the
vocabulary consists of six terms.
28 / 52
Example: Classification
Pˆ (c|d5) ∝ 3/4 · (3/7)3 · 1/14 · 1/14 ≈ 0.0003
Pˆ (c|d5) ∝ 1/4 · (2/9)3 · 2/9 · 2/9 ≈ 0.0001
Thus, the classifier assigns the test document to c = China. The reason for
this classification decision is that the three occurrences of the positive
indicator Chinese in d5 outweigh the occurrences of the two negative
indicators Japan and Tokyo.
29 / 52
Time complexity of Naive Bayes
mode time complexity
training Θ(|D|Lave + |C||V |)
testing Θ(La + |C|M a) = Θ(|C|M a)
- Lave: average length of a training doc, La: length of the test doc,M a:
number of distinct terms in the test doc, D: training set, V :
vocabulary, C: set of classes
- Θ(|D|Lave) is the time it takes to compute all counts.
- Θ(|C||V |) is the time it takes to compute the parameters from the
counts.
- Generally: |C||V | < |D|Lave
- Test time is also linear (in the length of the test document).
- Thus: Naive Bayes is linear in the size of the training set (training) and
the test document (testing). This is optimal.
30 / 52
NB theory
Naive Bayes: Analysis
Now we want to gain a better understanding of the properties of
Naive Bayes.
We will formally derive the classification rule …
…and make our assumptions explicit.
32 / 52
Derivation of Naive Bayes rule
We want to find the class that is most likely given the document:
cmap = argmax
c∈C
P (c|d)
Apply Bayes rule P (A|B) = P (B|A)P (A)P (B) :
cmap = argmax
c∈C
P (d|c)P (c)
P (d)
Drop denominator since P (d) is the same for all classes:
cmap = argmax
c∈C
P (d|c)P (c)
33 / 52
Too many parameters / sparseness
cmap = argmax
c∈C
P (d|c)P (c)
= argmax
c∈C
P (⟨t1, . . . , tk, . . . , tnd⟩|c)P (c)
There are too many parameters P (⟨t1, . . . , tk, . . . , tnd⟩|c), one for
each unique combination of a class and a sequence of words.
We would need a very, very large number of training examples to
estimate that many parameters.
This is the problem of data sparseness.
34 / 52
Naive Bayes conditional independence assumption
To reduce the number of parameters to a manageable size, we make the
Naive Bayes conditional independence assumption:
P (d|c) = P (⟨t1, . . . , tnd⟩|c) =

1≤k≤nd
P (Xk = tk|c)
We assume that the probability of observing the conjunction of attributes
is equal to the product of the individual probabilities P (Xk = tk|c). Recall
from earlier the estimates for these conditional probabilities:
Pˆ (t|c) = Tct+1(∑
t′∈V Tct′ )+B
35 / 52
Generative model
C=China
X1=Beijing X2=and X3=Taipei X4=join X5=WTO
P (c|d) ∝ P (c)∏1≤k≤nd P (tk|c)
Generate a class with probability P (c)
Generate each of the words (in their respective positions), conditional
on the class, but independent of each other, with probability P (tk|c)
To classify docs, we “reengineer” this process and find the class that is
most likely to have generated the doc.
36 / 52
Second independence assumption
Pˆ (Xk1 = t|c) = Pˆ (Xk2 = t|c)
For example, for a document in the class UK, the probability of
generating queen in the first position of the document is the same as
generating it in the last position.
The two independence assumptions amount to the bag of words
model.
37 / 52
A different Naive Bayes model: Bernoulli model
UAlaska=0 UBeijing=1 U India=0 U join=1 UTaipei=1 UWTO=1
C=China
38 / 52
Violation of Naive Bayes independence assumptions
Conditional independence:
P (⟨t1, . . . , tnd⟩|c) =

1≤k≤nd
P (Xk = tk|c)
Positional independence:
Pˆ (Xk1 = t|c) = Pˆ (Xk2 = t|c)
The independence assumptions do not really hold of documents
written in natural language.
Exercise
- Examples for why conditional independence assumption is not
really true?
- Examples for why positional independence assumption is not
really true?
How can Naive Bayes work if it makes such inappropriate
assumptions?
39 / 52
Why does Naive Bayes work?
Naive Bayes can work well even though conditional independence
Example:
c1 c2 class selected
true probability P (c|d) 0.6 0.4 c1
Pˆ (c)∏1≤k≤nd Pˆ (tk|c) 0.00099 0.00001
NB estimate Pˆ (c|d) 0.99 0.01 c1
Double counting of evidence causes underestimation (0.01) and
overestimation (0.99).
accurately estimating probabilities.
Naive Bayes is terrible for correct estimation …
…but if often performs well at accurate prediction (choosing the
correct class).
40 / 52
Naive Bayes is not so naive
- Naive Bayes has won some bakeoffs (e.g., KDD-CUP 97)
- More robust to nonrelevant features than some more complex
learning methods
- More robust to concept drift (changing of definition of class over time)
than some more complex learning methods
- Better than methods like decision trees when we have many equally
important features
- A good dependable baseline for text classification (but not the best)
- Optimal if independence assumptions hold (never true for text, but
true for some domains)
- Very fast
- Low storage requirements
41 / 52
Evaluation of TC
Evaluation on Reuters
classes:
training
set:
test
set:
regions industries subject areas
γ(d′) =China
first
private
Chinese
airline
UK China poultry coffee elections sports
London
congestion
Big Ben
Parliament
the Queen
Windsor
Beijing
Olympics
Great Wall
tourism
communist
Mao
chicken
feed
ducks
pate
turkey
bird flu
beans
roasting
robusta
arabica
harvest
Kenya
recount
run­off
seat
campaign
baseball
diamond
soccer
forward
captain
team
d′
43 / 52
Example: The Reuters collection
symbol statistic value
N documents 800,000
L avg. # word tokens per document 200
M word types 400,000
type of class number examples
region 366 UK, China
industry 870 poultry, coffee
subject area 126 elections, sports
44 / 52
A Reuters document
45 / 52
Evaluating classification
Evaluation must be done on test data that are independent of the
training data, i.e., training and test sets are disjoint.
It’s easy to get good performance on a test set that was available to
the learner during training (e.g., just memorize the test set).
Measures: Precision, recall, F1, classification accuracy
46 / 52
Precision P and recallR
in the class not in the class
predicted to be in the class true positives (TP) false positives (FP)
predicted to not be in the class false negatives (FN) true negatives (TN)
TP, FP, FN, TN are counts of documents. The sum of these four counts is the
total number of documents.
precision:P = TP/(TP + FP )
recall:R = TP/(TP + FN)
47 / 52
A combined measure: F
F1 allows us to trade off precision against recall.
F1 =
1
1
2
1
P +
1
2
1
R
= 2PR
P +R
This is the harmonic mean of P and R: 1F = 12( 1P + 1R)
48 / 52
Averaging: Micro vs. Macro
We now have an evaluation measure (F1) for one class.
But we also want a single number that measures the aggregate
performance over all classes in the collection.
Macroaveraging
Compute F1 for each of the C classes
Average these C numbers
Microaveraging
Compute TP, FP, FN for each of the C classes
Sum these C numbers (e.g., all TP to get aggregate TP)
Compute F1 for aggregate TP, FP, FN
49 / 52
F1 scores for Naive Bayes vs. other methods
(a) NB Rocchio kNN SVM
micro-avg-L (90 classes) 80 85 86 89
macro-avg (90 classes) 47 59 60 60
(b) NB Rocchio kNN trees SVM
earn 96 93 97 98 98
acq 88 65 92 90 94
money-fx 57 47 78 66 75
grain 79 68 82 85 95
crude 80 70 86 85 89
trade 64 65 77 73 76
interest 65 63 74 67 78
ship 85 49 79 74 86
wheat 70 69 77 93 92
corn 65 48 78 92 90
micro-avg (top 10) 82 65 82 88 92
micro-avg-D (118 classes) 75 62 n/a n/a 87
Naive Bayes does pretty well, but some methods beat it consistently (e.g., SVM).
50 / 52
Take-away today
Text classification: definition & relevance to information retrieval
Naive Bayes: simple baseline text classifier
Theory: derivation of Naive Bayes classification rule & analysis
Evaluation of text classification: how do we know it worked / didn’t
work?
51 / 52
Resources
Chapter 13 of IIR
Resources at http://cislmu.org
Weka: A data mining software package that includes an
implementation of Naive Bayes
Reuters-21578 – text classification evaluation set
Vulgarity classifier fail
52 / 52 