xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

Python代写 - COMP6714 Information Retrieval and Web Search

时间：2020-11-23

Question 1 (20 marks)
Briefly answer the following questions (1-4 sentences) in your script book. Lengthy but
irrelevant answers will be penalized.
(a) How does stemming typically affect recall? Why?
(b) Given at least two reasons why language identification is important when indexing
documents.
(c) Why specialized algorithms are needed to construct inverted index for large document collections?
(d) What are the largest gap that can be encoded in 2 bytes using variable byte code?
You also need to show the encoded two bytes.
(e) Why is cosine a better similarity metric than the inverse of Euclidean distance in
vector space model?
(f) Why is vector space model generally considered a better retrieval model than the
boolean model?
(g) List the advantage(s) of using NDCG to evaluate Web search results over measures
such as MAP.
(h) List one problem with the probabilistic ranking principle.
(i) In the early age of Web search engines (definitely pre-Google era), some system uses
the following term frequency weighting 2·tf
2+tf to fight spam Web pages. Explain why
this worked.
(j) What is a “shingle”, and describe briefly the shingling method to detect near duplicate documents.
(k) List at least three requirements that complicate the design and implementation of
an industrial strength crawler.
(l) Define the terms “hub” and “authority” in the context of the HITS algorithm. Can
a page be both a hub and authority page at the same time?
COMP6714 Page 1 of 11
Question 2 (5 marks)
Consider the algorithm (from the textbook) to intersect two postings lists p1 and p2.
Algorithm 1: Intersect(p1, p2)
answer ← ∅; 1
while p1 = nil and p2 = nil do 2
if docID(p1) = docID(p2) then 3
Add(answer, docID(p1); 4 p1 ← next(p1); 5 p2 ← next(p2); 6
else if docID(p1) < docID(p2) then 7 p1 ← next(p1); 8
else 9 p2 ← next(p2); 10
return answer; 11
(a) What is the time complexity of the algorithm?
(b) Modify the algorithm so that it can answer queries like A AND NOT B in time
O(|p1| + |p2|), where A and B are two terms.
(c) Is it possible to modify the algorithm so that it can answer queries like A OR NOT
B in time O(|p1| + |p2|)? If not, what complexity can you achieve?
COMP6714 Page 2 of 11
Question 3 (10 marks)
Consider a casual user who input the boolean query “A OR B AND C”. Our system deems
the query as ambiguous, as either the OR or the AND operator can be executed first. To
be on the safe side, the system decides to retrieve those results that belong to both
interpretations only (i.e., no matter which interpretation the user intended, it will include
our system’s result). Describe how to support such query efficiently by accessing the
inverted lists of tokens A, B, and C at most once.
COMP6714 Page 3 of 11
Question 4 (10 marks)
From the following sequence of γ-coded gaps, reconstruct first the gap sequence and
then the postings sequence (assume that docid starts from 1). Note that spaces were
deliberately added for clarity purpose only. You need to illustrate your steps.
1110 1101 1111 1001 0111 1111 1110 1000 1111 1001
COMP6714 Page 4 of 11
Question 5 (10 marks)
The figure below shows the output of two information retrieval systems on the same two
queries in a competitive evaluation. The top 15 ranks are shown. Crosses correspond
to a document which has been judged relevant by a human judge; dashes correspond to
irrelevant documents. There are no relevant documents in lower ranks.
System 1: System 2:
Rank Q1 Q2
1 - X 2 X - 3 X - 4 X - 5 - - 6 - - 7 - - 8 X - 9 X -
10 X -
11 X -
12 - -
13 - X
14 - X
15 X -
Rank Q1 Q2
1 X X 2 X - 3 X - 4 - X 5 X X 6 X - 7 - - 8 - - 9 - -
10 - -
11 X -
12 X -
13 - -
14 - -
15 X -
(a) Explain the following evaluation metrics and give results for query Q1 for both
systems.
1. Precision at rank 10.
2. Recall at precision 0.5.
(b) The metrics in part (a) above are not adequate measures of system performance for
arbitrary queries. Why not? What other disadvantages do these metrics have?
(c) Give the formula for mean average precision (MAP), and calculating MAP for both
systems.
(d) For each system, draw a precision-recall curve. Explain how you arrived at your
result.
COMP6714 Page 5 of 11
Question 6 (10 marks)
Determine the new query vector determined by the Rocchio relevant feedback algorithm
(α = β = γ = 1.0), given that the initial query is “t1 t3” and we have the following
documents and user feedback.
docid t1 t2 t3 t4 feedback
1 2 1 0 0 R 2 3 2 1 0 NR
3 0 3 0 3 R 4 2 1 2 2 NR
5 0 1 2 3 NR
Note: “R” standards for relevant and “NR” stands for non-relevant.
COMP6714 Page 6 of 11
Question 7 (10 marks)
(a) State and justify briefly the assumptions made to derive Equations (3) from (2) and
Equation (6) from (5) in the Binary Independence Model.
(b) State which values need to be estimated for a document collection in the final
Equation (8) (i.e., other parts can be discarded safely without affecting the ranking).
Let ~x be the binary term incidence vector representing document D, O(p) be the odd
ratio of probability p, Q be the query, R and NR stand for “relevant” and “non-relevant”,
respectively, V is the vocabulary.
In addition, we use the shorthand notations: pi = p(xi = 1|R, Q) and ri = p(xi = 1|NR, q).
O(R|Q, ~x) = p(R|Q, ~x) p(NR|Q, ~x)
(1)
= p(R|Q) p(NR|Q) · p(~x|R, Q) p(~x|NR, Q)
(2)
= O(p(R|Q)) · |V | Yi=1
p(xi|R, Q) p(xi|NR, Q)
(3)
= O(p(R|Q)) · Yxi=1
p(xi = 1|R, Q) p(xi = 1|NR, Q) · Yxi=0
p(xi = 0|R, Q) p(xi = 0|NR, Q)
(4)
= O(p(R|Q)) · Yxi=1
pi ri · Yxi=0
1 鮶 pi 1 鮶 ri
(5)
= O(p(R|Q)) · Yxi=1,xi∈Q pi ri · Yxi=0,xi∈Q 1 鮶 pi 1 鮶 ri
(6)
= O(p(R|Q)) · Yxi=1,xi∈Q pi ri · Qxi∈Q 1鮶pi 1鮶ri Qxi=1,xi∈Q 1鮶pi 1鮶ri !
(7)
= O(p(R|Q)) · Yxi=1,xi∈Q pi(1 鮶 ri) ri(1 鮶 pi) · Yxi∈Q 1 鮶 pi 1 鮶 ri
(8)
COMP6714 Page 7 of 11
Question 8 (5 marks)
Suppose we have a document collection with an extremely small vocabulary with only
6 words w1, w2, . . . , w6. The following table shows the estimated background language
model p(w|C) using the whole collection of documents (2nd column) and the word counts
for document d1 (3rd column) and d2 (4th column), where c(w, di) is the count of word
w in document di
. Let Q = {w1, w2, w3, w4} be a query.
Word p(w|C) c(w, d1) c(w, d2) w1 0.800 2 7 w2 0.100 3 1 w3 0.025 1 1 w4 0.025 2 1 w5 0.025 2 0 w6 0.025 0 0
(a) Suppose we do not smooth the language model for d1 and d2. Compute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute
the log-likelihood. You should use the scientific notation (e.g., 0.0061 should be
6.1 × 1003
) Which document would be ranked higher?
(b) Suppose we now smooth the language model for d1 and d2 using the Jelinek-Mercer
smoothing method with λ = 0.8 (i.e., p(w|d) = λ·pmle(w|Md)+(11λ)·pmle(w|Mc)).
Recompute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2)
(Do not compute the log-likelihood. You should use the scientific notation) Which
document would be ranked higher?
COMP6714 Page 8 of 11
Question 9 (10 marks)
Consider the following web graph:
Page A points to page B, C, and D.
Page B points to C and D.
Page C points to A and E.
Page D points to E and F.
Page E points to G.
Page F points to G and H.
Consider a crawler that starts from page A
(a) Give the order of the indexing, assuming the crawler uses a URL frontier with
duplicate detection, and all the pages are at different web sites.
(b) Assume pages B, C, F, H are on web site α, pages D, E, G are on web site β, and
page A is on web site γ. The politeness policies on these three web sites all specify
at least 3 seconds between each visit (i.e., if the crawler visit a web site at the i
second, the earliest time it can revisit the web site is the i + 3 second). We assume
that (1) the crawler can only fetch a page every one second, and all the processing
(including physically getting the page, extracting and processing the links, etc.)
can be completed before the next fetch. (2) the crawler process links in the order
mentioned above.
The crawler still uses a ULR frontier with duplicate detection, and also uses back
queues to adhere to the politeness policies. Give the order of the indexing. (If two
pages can be visited at the same time, we always choose the smaller one according
to the alphabetical order)
COMP6714 Page 9 of 11
Question 10 (10 marks)
V W X Y Z
(a) Explain the concept of PageRank, and how it is calculated in practice.
(b) Why is it relevant for Web search?
(c) Give, and briefly explain, the corresponding matrix notation of the PageRank computation.
(d) Show the final matrix that will be used for the PageRank calculation for the above
graph, if the random teleporting probability is 0.2.
(e) Perform two iterations starting from the initial probability distribution vector of
(0.2, 0.2, 0.2, 0.2, 0.2).
COMP6714 Page 10 of 11
BONUS
Question 11 (5 marks)
Explain analytically why galloping search (aka. double binary search) is preferred to the
normal binary search when implementing the skipTo(docid) method on a sorted list of
docids. Make sure you state clearly the meaning of variables and any assumption you
use.
END OF EXAM PAPER