python代写-COMP4650/6490|学霸联盟

python代写-COMP4650/6490

时间：2022-06-06

COMP4650/6490
Document Analysis
2021
Introduction to Information Retrieval
School of Computing, ANU
Introduction to Information
Retrieval
1
IR Module Overview
Information retrieval (IR) part consists of
four lectures:
1. Introduction to IR + Boolean model
2. Ranked retrieval model
3. Evaluation of IR systems
4. Web search basics
2Introduction to Information Retrieval
Textbook
• Introduction to
information retrieval
• https://nlp.stanford.edu
/IR-
book/pdf/irbookonlinere
ading.pdf
• Chapters: 1, 2, 4, 6, 8,
19
Introduction to Information Retrieval 3
Table of Contents
• Lecture Overview
• Introduction to Boolean Retrieval
– Information Retrieval
– Term-Document Matrix
– Inverted Index
– Boolean Retrieval with Inverted Index
– Document Tokenization
4Introduction to Information Retrieval
What is Information Retrieval
5Introduction to Information Retrieval
What is information retrieval?
6Introduction to Information Retrieval
Information Retrieval
• “Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections (usually stored on computers).”
Manning et al.
• You may think of web search first, but there are many
other cases
– E-mail search
– Searching your laptop
– Corporate knowledge bases
– Image search, video search
7Introduction to Information Retrieval
Why information retrieval
• Information overload
– “It refers to the difficulty a person can have
understanding an issue and making decisions
that can be caused by the presence of too
much information.” - wiki
8Introduction to Information Retrieval
Why information retrieval
• An essential tool to deal with information
overload
9
You are
here!
Introduction to Information Retrieval
How to perform information retrieval
• Information retrieval when we did not have
a computer
10Introduction to Information Retrieval
Starting Point
• Collection: A set of documents
– Assume it as a static collection for the
moment
• Goal: Retrieve documents with information
that is relevant to the user’s information
need and helps the user complete a task
– User’s information need is often
underspecified
11Introduction to Information Retrieval
Classic Search Model
12
1. User task
2. Info need
3. Query
4. Search
engine
5. Result
6. Query
Refinement
Collection
Introduction to Information Retrieval
Classic Search Model
13
1. User task
2. Info need
3. Query
4. Search
engine
5. Result
6. Query
Refinement
Collection
Listen to music using a
Bluetooth headset
Info about connecting a
Bluetooth headset
CONNECT BLUETOOTH
HEADSET
PAIRING BLUETOOTH
HEADPHONES
Refinement!
Repeat!
Introduction to Information Retrieval
Key Objectives
• Every good IR system needs to achieve
– Scalability
• More than 40 billion pages are indexed by Google
– Accuracy
• Top 10 pages from 40 billion pages?
14Introduction to Information Retrieval
IR vs. NLP
• Information
retrieval
– Computational
approaches
– Statistical (shallow)
understanding of
language
– Handle large scale
problems
• Natural language
processing
– Cognitive, symbolic
and computational
approaches
– Semantic (deep)
understanding of
language
– (often) smaller
scale problems
15Introduction to Information Retrieval
IR and NLP are getting closer
• IR => NLP
– Larger data
collections
– Scalable/robust
NLP techniques,
e.g., translation
models
• NLP => IR
– Deep analysis of
text documents and
queries
– Information
extraction for
structured IR tasks
16Introduction to Information Retrieval
Queries and Indexing
Indexing: Storing a mapping from terms to
documents.
Querying: Looking up terms in the index and
returning documents.
Introduction to Information Retrieval 17
Search with Boolean query
• Boolean query
– E.g., “canberra” AND “healthcare” NOT “covid”
• Procedures
– Lookup query term in the dictionary
– Retrieve the posting lists
– Operation
• AND: intersect the posting lists
• OR: union the posting list
• NOT: diff the posting list
18Introduction to Information Retrieval
Retrieval procedure in modern IR
• Boolean model provides all the ranking
candidates
– Locate documents satisfying Boolean
condition
• E.g., “obama healthcare” -> “obama” OR
“healthcare”
• Rank candidates by relevance
– Important: the notation of relevance
• Efficiency consideration
– Top-k retrieval (Google)
19Introduction to Information Retrieval
Term-document incidence matrix
20Introduction to Information Retrieval
Incidence vectors
• We have a 0/1 vector for each term.
• To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented) -> bitwise AND
• 110100 AND 110111 AND 101111 = 100100
21Introduction to Information Retrieval
Efficiency
• Bigger Collections
– 1 million documents
– Each 1,000 words long
• Avg 6 bytes/word including spaces/punctuation
– 6GB of data in the documents.
• Assume there are M = 500K distinct terms among
these.
– Corresponds to a matrix with 500 billion entries
– But it has no more than one billion 1’s
– Extremely sparse matrix!
22
Efficient data structure tailored for boolean retrieval
=> Inverted index!
Introduction to Information Retrieval
Inverted Index
• Inverted index consists of a dictionary and
postings
– Dictionary: a set of unique terms
– Posting: variable-size array that keeps the list of
documents given term (sorted)
• INDEXER: Construct inverted index from raw text
23Introduction to Information Retrieval
Inverted Index
• For each term t, we must store a list of all documents that
contain t.
- Identify each doc by a docID, a document serial number
• We need variable-size postings lists
- On disk, a continuous run of postings is normal and best
- In memory, can use linked lists or variable length arrays
24Introduction to Information Retrieval
Inverted Index Construction
25
More in NLP
module!
Introduction to Information Retrieval
Initial Stages of Text Processing
• Tokenization
– Cut character sequence into word tokens
– Deal with “John’s”, "can't" -> "can" "not" ?
• Normalization
– Map text and query term to same form
– You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
– E.g. authorize, authorization; run, running
• Stop words
– We may omit very common words (or not)
– E.g. the, a, to, of
26Introduction to Information Retrieval
Indexer Step 1: Token sequence
• Scan documents for indexable terms
• Keep list of (token, docID) pairs.
27Introduction to Information Retrieval
Indexer Step 2:
• Sort tuples by terms (and then docID)
28Introduction to Information Retrieval
Indexer Step 3:
• Multiple term entries in a single document are merged.
• Split into Dictionary and Postings
• Doc. frequency information is added.
29Introduction to Information Retrieval
Boolean Retrieval with Inverted Index
• Easy to retrieve all documents containing term t
• How to find documents containing BRUTUS AND
CEASAR using postings?
à Linear time intersection algorithm for sorted lists
30Introduction to Information Retrieval
Intersection
31
The picture can't be displayed.
Introduction to Information Retrieval
Boolean Retrieval
• Can answer any query which is a Boolean
expression: AND, OR, NOT
• Precise: each document matches, or not
• Extended Boolean allows more complex
queries
• Primary commercial search for 30+ years,
and is still used
32Introduction to Information Retrieval
A Step back
• So far we assumed that we can easily
scan terms from a document.
• But the scanning consists of following
steps:
– Tokenization
– Stopwords
– Normalization
– Stemming and Lemmatization
33Introduction to Information Retrieval
Tokenization
• Task of chopping a document into tokens
• Chopping by whitespace and throwing punctuation are not
enough.
• Examples:
– New York, San Francisco
– Phone numbers ((800) 234 2333) and dates (Mar 11, 1983)
– LOWERCASE and LOWER CASE
– Hyphenation
• Co-education, Hewlett-Packard, doc-ument (line end)
• Regex tokenizer: simple and efficient
• Whatever method you use, always do the same tokenization
of document and query text
34Introduction to Information Retrieval
Stopword Removal & Normalization
• Stopword removal
– Stop words usually refer to the most common words in a
language.
– E.g. the, a, an, and, or, will, would, could
– Reduce the number of postings that a system has to store
– These words are not very useful in keyword search.
• Normalization
– Keep equivalence class of terms: U.S.A = USA = united
states
– Synonym list: car = automobile
– Capitalization: ferrarià Ferrari
– Case-folding: Automobile à automobile , CAT != cat
35Introduction to Information Retrieval
Stemming and Lemmatization
• {run, running, ran} à run
• Stemming
– Stemming turns tokens into
stems, which are the same
regardless of inflection.
– Stems need not be real
words
36
• Lemmatization
Lemmatization turns words into lemmas, which
are dictionary entries.
The word “better” has “good” as its lemma.
(depending on the dictionary used)
Introduction to Information Retrieval
Summary
• Lecture Overview
• Introduction to Boolean Retrieval
– Information Retrieval
– Term-Document Matrix
– Inverted Index
– Boolean Retrieval with Inverted Index
– Document Tokenization
• Any questions?
37Introduction to Information Retrieval
References
• Some lecture slides are from:
- Hongning Wang (2017) CS 4501/6501:
Information retrieval, CS@Uva
- Pandu Nayak and Prabhakar Raghavan,
CS276, Information Retrieval and Web
Search, Stanford University
38Introduction to Information Retrieval