python代写-COMP6490
时间:2022-06-06
COMP6490
Document Analysis
2022
Evaluation of IR Systems
School of Computing , ANU
So far..
1. Boolean Retrieval
2. Ranked Retrieval
2
Table of Contents
• Evaluation of IR systems
– Purpose of evaluation
– Test collection
– Evaluation of unranked retrieval sets
– Evaluation of ranked retrieval sets
3
Why do we need evaluation?
• To build IR systems that satisfy user’s information
needs
• Given multiple candidate systems, which one
is the best?
4
What do we want to evaluate?
System Efficiency
• Speed
• Storage
• Memory
• Cost
System Effectiveness
• Quality of search
result
• Does it find what
I’m looking for
• Does it return lots
of junk?
5
To improve system effectiveness
• IR system design:
– Which tokenizer? which stemmer?
– Which scoring method?
– tf-idf or wf-idf?
– Length normalization or not?
– Remove stop words?
What are the best choices?
6
Table of Contents
• Evaluation of IR systems
– Purpose of evaluation
– Test collection
– Evaluation of unranked retrieval sets
– Evaluation of ranked retrieval sets
7
For example
A test collection is a collection of relevance judgment on (query,
document) pairs.
• Query 1
– Doc 1: relevant
– Doc 2: irrelevant
– Doc 3: irrelevant
– Doc 4: relevant
– Doc 5: irrelevant
• Query 2
– Doc 1: irrelevant
– Doc 2: irrelevant
– Doc 3: relevant
– Doc 4: irrelevant
– Doc 5: relevant
This relevancy information is known as the ground truth. It is
typically constructed by trained human annotators.
8
Three Components of Test Collections
1. A collection of documents
2. A test suite of information needs,
expressible as queries
3. A set of relevance judgment; a binary
assessment of either relevant or
irrelevant for each query-document pair
9
Relevance Judgment
• Relevance is assessed relative to an
information need, not a query.
– For example, if our information need is:
• Information on whether drinking red wine is more
effective at reducing your risk of heart attacks than
white wine
• Candidate query: wine red white heart attack
effective
• A document is relevant if it addresses the
information need.
• The document does not need to contain
all/any of the query terms.
10
Standard Test Collections
11
Build Large Test Collection
• Recent collections do not have relevance
judgments on all possible pairs of (query,
document).
– It is just impossible!
– Given a query, try multiple IR systems and
obtain a set of candidate documents
– Multiple judges assess the relevancy of a
candidate doc given information need
(Chapter 8.5)
12
Table of Contents
• Evaluation of IR systems
– Purpose of evaluation
– Test collection
– Evaluation of unranked retrieval sets
– Evaluation of ranked retrieval sets
13
Evaluation Retrieval Result
Two evaluation settings:
– Evaluation of unranked retrieval sets (Boolean
retrieval)
• Ranks of retrived documents are not important
• Retrieved (returned) documents vs. Not retrieved
documents
– Evaluation of ranked retrieval sets
• Rank of retrieved documents are important
• Relevant documents should be ranked above
irrelevent documents
14
Evaluation of unranked retrieval sets
Unranked example:
– Say we have 10 documents.
– Given query q, the system returns 4
documents:
– The system has decided these 4 documents
are probably relevant to the query.
15
Contingency Table
• Contingency table: a summary table of retrieval result
• tp: Number of relevant documents returned by system
• fp: Number of irrelevant documents returned by system
• fn: Number of relevant documents not returned by system
• tn: Number of irrelevant documents not returned by system
16
Precision, Recall, and Accuracy
• Precision: fraction of retrieved documents that are
relevant
• Recall: fraction of relevant documents that are retrieved
• Accuracy: fraction of relevant documents that are correct
17
Precision and Recall
From the previous example:
18
Accuracy is not appropriate for IR
Assume we have 100 documents, and only 1
document is relevant given a certain query q.
• Accuracy of System 1: 0.99
• Accuracy of System 2: 0.96
• System 1 performs better in terms of accuracy but
retrieved no relevant documents.
19
F-Measure
F-measure: a single measure that trades off precision
and recall
This is the weighted harmonic mean of precision(P)
and recall(R) .
– α > 0.5: emphasises precision, e.g., (α = 1)) Precision
– α < 0.5: emphasises recall, e.g., (α = 0)) Recall
F1-measure: the harmonic mean of precision and
recall (α = 0.5)
20
Table of Contents
• Evaluation of IR systems
– Purpose of evaluation
– Test collection
– Evaluation of unranked retrieval sets
– Evaluation of ranked retrieval sets
21
Evaluation of ranked retrieval sets
Scenario: With the same pair of documents (10
docs) and query, an IR system generates ranked
results as follows:
• Unlike the previous example, the system
retrieved all documents in our collection.
• Precision & Recall cannot be directly
applied in this case.
• Need a metric to measure the performance
of ranked list!
How can we quantify the performance of this result?
22
Precision-Recall Curve
• What are the precision and recall when top
k docs retrieved?
23
Precision-Recall Curve
• Compute recall and precision at each rank k (i.e.
using the top k docs)
• Plot (recall, precision) points until recall is 1
24
Precision-Recall Curve
25
Precision-Recall Curve
26
Precision-Recall Curve
27
Precision-Recall Curve
28
Precision-Recall Curve
29
Precision-Recall Curve
30
Interpolated Precision-Recall
At a given recall level use the maximum
precision at all higher recall levels.
Makes it easier to interpret
Intuition: There is no disadvantage to
retrieving more documents if both
precision and recall improve.
31
Interpolated Precision-Recall Curve
For system evaluation we need to average across many queries. It
is not easy to average a PR curve in its current form.
Solution:
11-point interpolated PR curve
Interpolated precision at 11 different recall points.
Table: 11-point
interpolated
precision.
32
Interpolated Average Precision
• For system evaluation
– Each point in the 11-point interpolated
precision is averaged across all queries in
the test collection
– A perfect system will have a straight line
from (0,1) to (1,1)
33
Single Number Metrics
Precision-Recall curves can be useful but
sometimes we would like to use a single
number to compare systems.
34
Average Precision and MAP
Average Precision is the area under the
uninterpolated PR curve for a single
query.
Mean Average Precision (MAP) is the
mean of the average precision for many
queries.
MAP is a single figure metric across all
recall levels.
Good if we care about all recall levels.
35
Mean Reciprocal Rank (MRR)
• MRR = Averaged inverse rank of the first
relevent document.
For if we only care about how high in the ranking the first
relevant document is.
36
Other ranking measures
• Precision at K
– Average precision at top k documents
• Recall at K
– Average recall at top k documents
• Receiver Operating Characteristics (ROC)
curve
• Normalized Discounted Cumulative Gain
(NDCG)
– Requires graded relevance judgements
37
Summary
• Evaluation of IR systems
– Purpose of evaluation
– Test collection
– Evaluation of unranked retrieval sets
– Evaluation of ranked retrieval sets
38
References
• Some lecture slides are from:
Pandu Nayak and Prabhakar Raghavan,
CS276
Information Retrieval and Web Search,
Stanford University
39
essay、essay代写