python代写-COMP5046
时间:2022-05-10
COMP5046
Natural Language Processing
Lecture 9: Named Entity Recognition and
Coreference Resolution
Semester 1, 2022
School of Computer Science,
University of Sydney
Dr. Caren Han
1. Information Extraction
2. Named Entity Recognition (NER) and Evaluation
3. Traditional NER
4. Sequence Model for NER
5. Coreference Resolution
6. Coreference Model
7. Coreference Evaluation
8. Preview
Lecture 9: Named Entity Recognition and Coreference
0 LECTURE PLAN
Information Extraction
“The task of automatically extracting structured information from
unstructured and/or semi-structured machine-readable documents”
Here are some questions..
• How to allow computation to be done on the unstructured data
• How to extract clear, factual information
• How to put in a semantically precise form that allows further inferences
to be made by computer algorithms
1 Information Extraction
Who:
What:
Where:
When:
How:

1 Information Extraction
How to extract the structured clear, factual information
• Find and understand limited relevant parts of texts
• Gather information from many pieces of text
• Produce a structured representation of relevant information
relations (in the database sense) or a knowledgebase
“5W1H”
who, what, where, when, why, how
1 Information Extraction
How to extract the structured clear, factual information
• Find and understand limited relevant parts of texts
• Gather information from many pieces of text
• Produce a structured representation of relevant information
relations (in the database sense) or a knowledgebase
Textual abstract: Summary for human
Extracting
Subject Relation Object
Sydney IS-A Capital of NewSouth Wales
Sydney IS-A Australia’s largestcities
Sydney KNOWN FOR Sydney Opera House
… … …
Structured information: Summary for machine
Information Extraction Pipeline with NLP
1 Information Extraction
PoS Tagging
Stemming
Tokenisation
Parsing
Entity
Extraction
I love my cats [I] [love] [my] [cats]
I love my cats [I] [love] [my] [cat]
I love my cats [I/JJ] [love/VBP] [my/PRP] [cats/NNS]
I love my cats
Understanding
Coreference
Resolution
Caren loves cats, and she
likes playing with them
Caren loves cats, and she
likes playing with them
A
p
p
lic
a
ti
o
n
N
LP
S
ta
ck
Sentiment
Analysis
Caren loves cats, and she
likes playing with them
[positive: 90.10%] [neutral: 4.70%] [negative: 5.10%]
What is Named Entity Recognition?
“The subtask of information extraction that seeks to locate and classify
named entity mentions in unstructured text into pre-defined categories
such as the person names, organizations, locations, medical codes,
time expressions, quantities, monetary values, percentages, etc.”
Why recognise Named Entities?
• Named entities can be indexed, linked off, etc.
• Sentiment can be attributed to companies or products
• A lot of relations are associations between named entities
• For question answering, answers are often named entities.
2 Named Entity Recognition (NER)
How to recognize Named Entities?
Identify and classify names in text
• The University of Sydney (informally USYD, Sydney, Sydney Uni) is an
Australian public research university in Sydney, Australia. Founded in 1850, it
was Australia's first university and is regarded as one of the world's leading
universities. (Wikipedia, University of Sydney)
Different types of named entity classes
2 Named Entity Recognition (NER)
Type Classes
3 class Location, Person, Organization
4 class Location, Person, Organization, Misc
7 class Location, Person, Organization, Money, Percent, Date, Time
*classes can be different based on annotated dataset
2 Named Entity Recognition (NER)
How to recognize Named Entities?
Identify and classify names in text
Upenn CogComp-NLP
Stanford CoreNLP 3.9.2
http://macniece.seas.upenn.edu:4004/
http://nlp.stanford.edu:8080/corenlp/process
How to evaluate the NER performance?
The goal: predicting entities in a text
*Standard evaluation is per entity, not per token
2 Named Entity Recognition (NER)
Caren Soyeon Han is working at Google at Sydney, Australia
LOC
LOC
LOC
LOC
O O ORG
O O ORG
PER
O
PER PER O
O O O
gold
predicted
O
O
2 Named Entity Recognition (NER)
Recall =
How to evaluate the NER performance? Precision and recall
PERSON NOT PERSON
Precision =
Total number of
actual ‘PERSON’ entities
Detected as ‘PERSON’ correctly
Detected as ‘PERSON’ correctly
Total number of
detected as ‘PERSON’
Detected as ‘PERSON’
2 Named Entity Recognition (NER)
How to evaluate the NER performance? Precision and recall
PERSON NOT PERSON
Detected as ‘PERSON’
True positives: The ‘PERSON’s that the model detected as ‘PERSON’
False positives: The NOT ‘PERSON’s that the model detected as ‘PERSON’
False negatives: The ‘PERSON’s that the model detected as NOT ‘PERSON’
True negatives: The ‘NOT PERSON’s that the model detected as NOT ‘PERSON’
How to evaluate the NER performance?
The goal: predicting entities in a text
*Standard evaluation is per entity, not per token
2 Named Entity Recognition (NER)
Caren Soyeon Han is working at Google at Sydney, Australia
LOC
LOC
LOC
LOC
O O ORG
O O ORG
PER
O
PER PER O
O O O
gold
predicted
correct not correct
selected 2 0
not selected 1 0
s l ct True Positive
(TP)
False Positive
(FP)
l False Negative
(FN)
True Negative
(TN)
O
O
How to evaluate the NER performance?
The goal: predicting entities in a text
*Standard evaluation is per entity, not per token
2 Named Entity Recognition (NER)
Precision and Recall are straightforward for text categorization or
web search, where there is only one grain size (documents)
Caren Soyeon Han is working at Google at Sydney, Australia
LOC
LOC
LOC
LOC
O O ORG
O O ORG
PER
O
PER PER O
O O O
gold
predicted
O
O
Quick Exercise: F measure Calculation
Let’s calculate Precision, Recall, and F-measure together!
P = ?? R = ?? F1 = ??
2 Named Entity Recognition (NER)
F1 = 2 *
P+R
P*R
correct not correct
selected 2 (TP) 0 (FP)
not selected 1 (FN) 0 (TN)
Data for learning named entity
• Training counts joint frequencies in a corpus
• The more training data, the better
• Annotated corpora are small and expensive
Corpora Source Size Class Type
muc-7 New York Times 164k tokens per, org, loc, dates, times, money, percent
conll-03 Reuters 301k per, org, loc, misc
bbn Wall Street Journal 1174k https://catalog.ldc.upenn.edu/docs/LDC2005T33/BBN-
Types-Subtypes.html
2 Named Entity Recognition (NER)
https://aclweb.org/aclwiki/MUC-7_(State_of_the_art)
Data for learning named entity
• Models trained on one corpus perform poorly on others
train
F-score
muc conll bbn
muc 82.3 54.9 69.3
conll 69.9 86.9 60.2
bnn 80.2 58.0 88.0
2 Named Entity Recognition (NER)
CoNLL 2003 NER dataset
• Performance measure: F = 2 * Precision * Recall / (Recall + Precision)
2 Named Entity Recognition (NER)
https://paperswithcode.com/sota/named-entity-recognition-ner-on-conll-2003
Datasets for NER in English
2 Named Entity Recognition (NER)
https://paperswithcode.com/task/named-entity-recognition-nerhttps://github.com/juand-r/entity-recognition-datasets
The following table shows the list of datasets for English entity recognition.
DUA: Data Use Agreement
LDC: Linguistic Data Consortium
CC-BY 4.0: Creative Commons Attribution 4.0
Three standard approaches to NER
3 Traditional NER
• Rule-based NER
• Classifier-based NER
• Sequence Model for NER
Traditional Approaches
Rule-based NER
• Entity references have internal and external language cues
Mr. [per Scott Morrison] flew to [loc Beijing]
• Can recognise names using lists (or gazetteers):
– Personal titles: Mr, Miss, Dr, President
– Given names: Scott, David, James
– Corporate suffixes: & Co., Corp., Ltd.
– Organisations: Microsoft, IBM, Telstra
• and rules:
– personal title X ⇒ per
– X, location ⇒ loc ororg
– travel verb to X ⇒ loc
• Effectively regular expressions, PoS Tagger
3 Traditional NER
Rule-based NER
• Determining which person holds what office in what organization
– [person] , [office] of [org]
• Michael Spence, the vice-chancellor and principal of the University of Sydney
– [org] (named, appointed, etc.) [person] Prep [office]
• WHO appointed Tedros Adhanom as Director-General
• Determining where an organization is located
– [org] in [loc]
• Google headquarters in California
– [org] [loc] (division, branch, headquarters, etc.)
• Google London headquarters
3 Traditional NER
Statistical approaches are more portable
• Learn NER from annotated text
– weights (≈ rules) calculated from the corpus
– same machine learner, different language ordomain
• Token-by-token classification (with any machine learning)
• Each token may be:
– not part of an entity (tag o)
– beginning an entity (tag b-per, b-org, etc.)
– continuing an entity (tag i-per, i-org,etc.)
• What about N-gram model?
3 Traditional NER
Various features for statistical NER
3 Traditional NER
Unigram Mr. Scott Morrison flew to Beijing
Lowercase unigram mr. scott morrison flew to beijing
POS tag nnp nnp nnp vbd to nnp
length 3 5 4 4 2 7
In first-name gazetteer no yes no no no no
In location gazetteer no no no no no yes
3-letter suffix Mr. ott son lew - ing
2-letter suffix r. tt on ew to ng
1-letter suffix . t n w o g
Tag predictions O B-per I-per O O B-loc
Various features for statistical NER
3 Traditional NER
Unigram Mr. Scott Morrison flew to Beijing
Lowercase unigram mr. scott morrison flew to beijing
POS tag nnp nnp nnp vbd to nnp
length 3 5 4 4 2 7
In first-name gazetteer no yes no no no no
In location gazetteer no no no no no yes
3-letter suffix Mr. ott son lew - ing
2-letter suffix r. tt on ew to ng
1-letter suffix . t n w o g
Tag predictions O B-per I-per O O B-loc
Predictive ModelMr. Scott Morrison lives in Sydney O B-PER I-PER O O B-LOC
Traditional NER Approaches - Pros and Cons
Rule-based approaches
• Can be high-performing and efficient
• Require experts to make rules
• Rely heavily on gazetteers that are always incomplete
• Are not robust to new domains and languages
Statistical approaches
• Require (expert-)annotated training data
• May identify unforeseen patterns
• Can still make use of gazetteers
• Are robust for experimentation with new features
• Are largely portable to new languages and domains
3 Traditional NER
Sequence Model (N to N)
4 Sequence Model for NER
Sequence 2 Sequence Learning
Output: Part of Speech
Input: Text
How is the weather today
ADV VERB DET NOUN NOUN
Sequence Model
4 Sequence Model for NER
Sequence 2 Sequence Learning
Output: NE tag
Entity class or other(O)
Input: Text
Scott Morrison is a prime minister of Australia
PER PER O O O O O LOC
Encoding classes for sequence labeling
4 Sequence Model for NER
The IOB (short for inside, outside, beginning) is a common tagging format
• I- prefix before a tag indicates that the tag is inside a chunk.
• B- prefix before a tag indicates that the tag is the beginning of a chunk.
• An O tag indicates that a token belongs to no chunk (outside).
Sequence 2 Sequence Learning
Output: NE tag
Entity class or other(O)
Input: TextScott Morrison is a prime minister of Australia
PER PER O O O O O LOC
Encoding classes for sequence labeling
4 Sequence Model for NER
Josiah tells Caren John Smith is a student
IO encoding PER O PER PER PER O O O
IOB encoding B-PER O B-PER B-PER I-PER O O O
n+1
2n+1
IO encoding vs IOB encoding
• Computation Time?
• Efficiency?
B-PER I-PER I-PEReven
The IO and IOB (inside, outside, beginning) is a common tagging format
Features for sequence labeling
Words
• Current word (essentially like a learned dictionary)
• Previous/next word (context)
Other kinds of inferred linguistic classification
• Part-of-speech tags
Label context
• Previous (and perhaps next) label
4 Sequence Model for NER
N to N Sequence model
• There are different NLP tasks that used N to N sequence model
4 Sequence Model for NER
POS tagging Named Entity Recognition Word Segmentation
Sequence Model (MEMM, CRF)
4 Sequence Model for NER
HMM MEMM CRF
Sequence Inference for NER
• For a Maximum Entropy Markov Model (MEMM), the classifier
makes a single decision at a time, conditioned on evidence from
observations and previous decisions
Features
4 Sequence Model for NER
-3 -2 -1 0 +1
Scott Morrison lives in Australia
NN NN VBZ IN NN
W0 in
W+1 Australia
W-1 lives
POS-1 VBZ
POS-2-POS-1 NN - VBZ
hasDigit? 0
… …
(Toutanova et al. 2003, etc.)
Sequence Inference for NER
Prediction (“O”)
4 Sequence Model for NER
Morrison livesScott in
Sequence Level
Australia
Local Level
Local Data
Feature
Extraction
in lives O
W0
W+1
W-1
POS-1
… …
Classification
LabelFeatures
Classifier (e.g. MEMM, CRF, or RNN)
with optimization (gradient)
The goal: predicting named entity mentions in unstructured text into pre-defined
categories such as the person names, organizations, locations
Named Entity Recognition
Caren Soyeon Han is working at Google at Sydney, Australia
LOC
LOC
LOC
LOC
O O ORG
O O ORG
PER
O
PER PER O
O O O
gold
predicted
O
O
Upenn CogComp-NLP http://macniece.seas.upenn.edu:4004/
4 Sequence Model for NER
We can easily apply Bi-LSTM (N to N Seq2Seq) Model to predict Named Entities
Named Entity Recognition with Bi-LSTM
4 Sequence Model for NER
We can easily apply Bi-LSTM (N to N Seq2Seq) Model to predict Named Entities
Named Entity Recognition with Bi-LSTM
*The model clearly contains incorrect predictions.
‘I’ cannot appear in the label of the first word. I-Per can only appear after B-Per.
I-Org can also appear only after B-Org.
4 Sequence Model for NER
We can easily apply Bi-LSTM (N to N Seq2Seq) Model to predict Named Entities
Named Entity Recognition with Bi-LSTM
*The model clearly contains incorrect predictions.
‘I’ cannot appear in the label of the first word. I-Per can only appear after B-Per.
I-Org can also appear only after B-Org.
What if we teach the
dependency between
predicted entity names
4 Sequence Model for NER
Hidden Markov Models (HMMs) are a class of probabilistic graphical model that
allow us to predict a sequence of unknown (hidden) variables from a set of
observed variables.
Wait? What about HMM?
• States are hidden
• Observable outcome linked to states
• Each state has observation probabilities
to determine the observable event
x states
y possible observations
a state transition probabilities
b output probabilities
hidden
X1 X2 X3
Y1 Y2 Y3
a
b
4 Sequence Model for NER
• The CRF model has addressed the labeling bias issue and eliminated unreasonable
hypotheses in HMM.
• MEMM adopts local variance normalization while CRF adopts global variance
normalization.
Advanced HMM (MEMM or CRF)
HMM
Maximum-entropy
Markov model (MEMM)
Conditional random
field (CRF)
4 Sequence Model for NER
• The CRF model has addressed the labeling bias issue and eliminated unreasonable
hypotheses in HMM.
• MEMM adopts local variance normalization while CRF adopts global variance
normalization.
Advanced HMM (MEMM or CRF)
4 Sequence Model for NER
What if we put CRF on top of the Bi-LSTM model. By adding a CRF layer, the
model can handle the dependency between predicted entity names
Named Entity Recognition with Bi-LSTM with CRF
predicted
4 Sequence Model for NER
Remember?
4 Sequence Model for NER
Greedy Inference
• Greedy inference:
– We just start at the left, and use our classifier at each position to assign a label
– The classifier can depend on previous labeling decisions as well as observed data
• Advantages:
– Fast, no extra memory requirements
– Very easy to implement
– With rich features including observations to the right, it may perform quite well
• Disadvantage:
– Greedy. We make commit errors we cannot recover from
4 Sequence Model for NER
Morrison livesScott in Australia
Beam Inference
• Beam inference:
– At each position keep the top k complete sequences.
– Extend each sequence in each local way.
– The extensions compete for the k slots at the nextposition.
• Advantages:
– Fast; beam sizes of 3–5 are almost as good as exact inference in manycases.
– Easy to implement (no dynamic programming required).
• Disadvantage:
– Inexact: the globally best sequence can fall off thebeam.
4 Sequence Model for NER
Morrison livesScott in Australia
Viterbi Inference
• Viterbi inference:
– Dynamic programming or memorisation.
– Requires small window of state influence (e.g., past two states are relevant).
• Advantage:
– Exact: the global best sequence is returned.
• Disadvantage:
– Harder to implement long-distance state-state interactions
4 Sequence Model for NER
livesScott Morrison in Australia
3 Probabilistic Approaches
Viterbi Algorithm
N
M
V
2/9
0
0
3/4
1/4
0
John will Pin Will
3 Probabilistic Approaches
Viterbi Algorithm
N
M
V
2/9
0
0
1/9
1
N
1/9
1/4
1/6
0
0
1/486
John will Pin Will
3 Probabilistic Approaches
Viterbi Algorithm
N
M
V
2/9
0
0
1/9
N
1/3
M
V
3/4
0
0
0
1/6 1/486
1/240
0
John will Pin Will
3 Probabilistic Approaches
Viterbi Algorithm
N
M
V
2/9
0
0
1/9
N
1/3
M
V
3/4
0
1/6 1/486
1/24
V
N
M
N
M
V
1/9 2/9 1/9
0 3/4
1/4 0
3/4
1/4
0
0
0
0 0
0
1/432
1/128
1/1152
1/1728 1/2592
John will Pin Will
NM
3 Probabilistic Approaches
Viterbi Algorithm
M
V
2/9
0
0
1/9
N
1/3
V
3/4
0
1/6 1/486
1/24
V
N
M
N
M
V
1/9 2/9 1/9
0 3/4
1/4 0
3/4
1/4
0
0
0
0 0
0
1/432
1/128
1/1152
1/1728 1/2592
John will Pin Will
3 Probabilistic Approaches
Viterbi Algorithm
N

2/9
1/3
M
3/4
1/6
1/24
V
N

1/9
1/4
3/4
1/128
1/1152
1/2592
John will Pin Will
3 Probabilistic Approaches
Viterbi Algorithm
https://web.stanford.edu/~jurafsky/slp3/8.pdf
4 Sequence Model for NER
What if there is a language that do not have any annotation?
NER in Low Resource Language
Current State of the Art model: Han et al. from Usyd NLP Research Group
NER and Coreference Resolution
NER only produces a list of entities in a text.
• “I voted for Scott because he was most aligned with my values”
Then, How to trace it?
Coreference Resolution is the task of finding all expressions that
refer to the same entity in a text
• “I voted for Scott because he was most aligned with my values”
– Scott  he
– I  my
5 Coreference Resolution
Donald Trump said he considered nominating Ivanka Trump to be president
of the World Bank because “she is very good with numbers,” according to a
new interview.
What is Coreference Resolution?
Finding all mentions that refer to the same entity
l i i i i I i
l k because “she is very good with nu bers,”
5 Coreference Resolution
What is Coreference Resolution?
Finding all mentions that refer to the same entity
Donald Trump said he considered nominating Ivanka Trump to be president
of the World Bank because “she is very good with numbers,”
5 Coreference Resolution
What is Coreference Resolution?
Finding all mentions that refer to the same entity
Donald said he considered nominating Ivanka Trump to be president of the
World Bank because “she is very good with numbers,”
5 Coreference Resolution
How to conduct Coreference Resolution?
1. Detect the mentions
* Mention: span of text referring to same entity
• Pronouns
e.g. I, your, it, she, him, etc.
• Named entities
e.g. people, places, organisation etc.
• Noun phrases
e.g. a cat, a big fat dog, etc.
5 Coreference Resolution
The difficulty in coreference resolution
1. Detect the mentions
* Mention: span of text referring to same entity
Tricky mentions…
• It was very interesting
• No staff
• The best university in Australia
5 Coreference Resolution
How to handle this tricky mentions? Classifiers!
How to conduct Coreference Resolution?
1. Detect the mentions
Donald Trump said he considered nominating Ivanka Trump to be president
of the World Bank because “she is very good with numbers,”
2. Cluster the mentions
Donald Trump said he considered nominating Ivanka Trump to be president
of the World Bank because “she is very good with numbers,”
5 Coreference Resolution
How to cluster the mentions and find the coreference
Coreference
It occurs when two or more expressions in a text refer to the same
person or thing.
• “Donald Trump is a president of the United States. Trump was
born and raised in the New York City borough of Queens”
Anaphora
The use of a word referring back to a word used earlier in a text or
conversation. Mostly noun phrases
• a word (anaphor) refers to another word (antecedent)
• “Donald Trump is a president of the United States. Before entering
politics, he was a businessman and television personality”
antecedent anaphor
5 Coreference Resolution
Coreference vs Anaphora
Coreference
Donald Trump
Trump
5 Coreference Resolution
Anaphora
Donald Trump
he
Not all anaphoric relations are coreferential
1. Not all noun phrases have reference
• Every student like his speech
• No student like his speech
2. Not all anaphoric relations are co-referential (bridging anaphora)
• I attended the meeting yesterday. The presentation was awesome!
Coreference anaphora
5 Coreference Resolution
Multiple expressions
same person or thing
Pronominal
anaphora
Adjectival
anaphora
Bridging
anaphora
cataphora
I almost stepped on it.
It was a big snake…
How to Cluster Mentions?
After detecting this all mentions in a text, we need to cluster them!
6 Coreference Model
Ivanka
Donald
her
he
she
Ivanka was happy that Donald said he
considered nominating her because she is
very good with numbers
How to Cluster Mentions?
After detecting this all mentions in a text, we need to cluster them!
6 Coreference Model
Ivanka
Donald
her
he
she
Ivanka was happy that Donald said he
considered nominating her because she is
very good with numbers
Gold cluster 1 Gold cluster 2
How to Cluster Mentions?
• Train a binary classifier that assigns every pair of mentions a
probability of being coreferent: ( ,)
6 Coreference Model
Ivanka
Donald
her
he
she
Ivanka was happy that Donald said he
considered nominating her because she is
very good with numbers
( ,)
(, )
0
(absolute negative)
1
(absolute positive)
Mention Pair Training
6 Coreference Model
• N mentions in a document
• = 1 if mentions and are coreferent, -1 if otherwise
• Just train with regular cross-entropy loss (looks a bit
because it is binary classification)
different
Iterate through
mentions
Iterate through candidate
antecedents (previously
occurring mentions)
Coreferent mentions pairs should get high
probability, others should get low probability
Mention Pair Testing
• Coreference resolution is a clustering task, but we are only scoring
pairs of mentions… what to do?
6 Coreference Model
• Pick some threshold (e.g., 0.5) and add coreference links between
mention pairs where p(mi, mj) is above the threshold
Ivanka
Donald
her
he
she
Ivanka was happy that Donald said he
considered nominating her because she is
very good with numbers
( ,)
Mention Pair Testing
• Pick some threshold (e.g., 0.5) and add coreference links between
mention pairs where ( ,) is above the threshold
• Take the transitive closure to get the clustering
6 Coreference Model
Ivanka
Donald
her
he
she
Ivanka was happy that Donald said he
considered nominating her because she is
very good with numbers
Even though the model did not predict this coreference link,
Ivanka and her are coreferent due to transitivity
( ,)
Mention Pair Testing: Issue
• Assume that we have a long document with the following mentions
• Michael… he … his … him …
• … won the game because he …
6 Coreference Model
Michael
he
him
his
he
Many mentions only have one clear antecedent but we
are asking the model to predict all of them
Alternative solution: instead train the model to predict
only one antecedent for each mention
Mention Ranking
6 Coreference Model
6 Coreference Model
6 Coreference Model
6 Coreference Model
Coreference Models: Training
• The current mention should be linked to any one of the candidate
antecedents it’s coreferent with.
• Mathematically, maximize this probability:
i —1
6 Coreference Model
( = 1 ) ( ,)
j=1
Iterate through candidate
antecedents (previously
occurring mentions)
For ones that
are coreferent
to mj…
…we want the model to
assign a high probability
Coreference Models: Training
• The current mention should be linked to any one of the candidate
antecedents it’s coreferent with.
• Mathematically, maximize this probability:
i —1
6 Coreference Model
( = 1 ) ( ,)
j=1
Iterate through candidate
antecedents (previously
occurring mentions)
For ones that
are coreferent
to mj…
…we want the model to
assign a high probability
The model could produce 0.9 probability for one of the correct antecedents
and low probability for everything else
Mention Ranking Models: Test Time
• Similar to mention-pair model except each mention is assigned only
one antecedent
6 Coreference Model
NA
Ivanka
Donald
her
he
she
How do we compute the probabilities?
• Non-neural statistical classifier
• Simple neural network
• More advanced model using LSTMs, attention
How do we compute the probabilities?
End to End Model (Lee at al., 2017)
• Current state-of-the-art model for coreference resolution (before 2019)
• Mention ranking model
• Improvements over simple feed-forward NN
• Use an LSTM
• Use attention (will learn about this in Lecture 10)
• Do mention detection and coreference end-to-end
• No mention detection step
6 Coreference Model
End to End Model (Lee at al., 2017)
• First embed the words in the document using a word embedding
matrix and a character-level embedding
6 Coreference Model
End to End Model (Lee at al., 2017)
• Then run a bidirectional LSTM over the document
6 Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as
a vector
6 Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as
a vector
6 Coreference Model
General, General Electric, General Electric said, … Electric, Electrid said,…
will all get its own vector representation
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as
a vector
6 Coreference Model
Span Representation:
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as
a vector. For example, for “the postal service”
6 Coreference Model
Span Representation:
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as
a vector. For example, for “the postal service”
6 Coreference Model
Span Representation:
Bi-LSTM hidden states for span’s start and end
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as
a vector. For example, for “the postal service”
6 Coreference Model
Span Representation:
Attention-based representation (Lecture 10)
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as
a vector. For example, for “the postal service”
6 Coreference Model
Span Representation:
Additional features
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as
a vector.
6 Coreference Model
Are spans i and j conference mentions? Is i a mention? Is j a mention? Do they look coreferent?
How to evaluate coreference?
There are different types of metrics available for evaluating coreference,
such as B-CUBED, MUC, CEAF, LEA, BLANC, or Often report the
average over a few different metrics
Predicted Cluster 1 Predicted Cluster 2
7 Coreference Evaluation
Donald Trump
Trump
he
his
himDonald
Hillary Clinton
her
She
Actual clusters Gold cluster 1 Gold cluster 2
How to evaluate coreference?
Let’s evaluate with B-CUBED metrics
• Compute Precision and Recall for each mention.
7 Coreference Evaluation
Donald Trump
Trump
he
his
himDonald
Hillary Clinton
her
She
Predicted Cluster 1 Predicted Cluster 2
Actual clusters Gold cluster 1 Gold cluster 2
P=4/5
R=4/6
P=1/5
R=1/3
P=2/4
R=2/6
P = 2/4
R= 2/3
How to evaluate coreference?
Let’s evaluate with B-CUBED metrics
• Compute precision and recall for each mention.
• Average the individual Ps and Rs
Predicted Cluster 1 Predicted Cluster 2
7 Coreference Evaluation
Donald Trump
Trump
he
his
himDonald
Hillary Clinton
her
She
Actual clusters Gold cluster 1 Gold cluster 2
P=4/5
R=4/6
P=1/5
R=1/3
P=2/4
R=2/6
P = 2/4
R= 2/3
Performance Comparison
OntoNotes dataset: ~3000 documents labeled by humans
• English and Chinese data
7 Coreference Evaluation
Model Approach English Chinese
Lee et al. (2010) Rule-based system ~55 ~50
Chen & Ng (2012)
[CoNLL 2012 Chinese winner]
Non-neural machine
learning models
54.5 57.6
Fernandes (2012)
[CoNLL 2012 English winner]
60.7 51.6
Wiseman et al. (2015) Neural mention ranker 63.3 —
Lee et al. (2017) Neural mention ranker (end-
to-end style)
67.2 --
UsydNLP (2019) Neural mention ranker with
lemma cross validation
74.87 --
Attention and Reading Comprehension
8 Preview: Week 10
are youHow ?One-hot vector
Encoder
embedding layer
Encoder
recurrent layer
Attention
scores
One-hot vector
Decoder
embedding layer
Decoder
Recurrent Layer
Attention
distribution
softmax
Attention
output

I am fine
Transformer and Machine Translation
0 Preview: Week 11
Reference for this lecture
/ Reference
• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language
Applications Using Deep Learning. " O'Reilly Media, Inc.".
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing.
MIT press.
• Manning, C 2018, Natural Language Processing with Deep Learning, lecture notes, Stanford University
• Li, J., Galley, M., Brockett, C., Gao, J., & Dolan, B. (2015). A diversity-promoting objective function for neural
conversation models. arXiv preprint arXiv:1510.03055.
• Jiang, S., & de Rijke, M. (2018). Why are Sequence-to-Sequence Models So Dull? Understanding the Low-
Diversity Problem of Chatbots. arXiv preprint arXiv:1809.01941.
• Liu, C. W., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., & Pineau, J. (2016). How not to evaluate your
dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.
arXiv preprint arXiv:1603.08023.

essay、essay代写