Python代写 - COMP6714 Information Retrieval and Web Search
Question 1 (20 marks) Briefly answer the following questions (1-4 sentences) in your script book. Lengthy but irrelevant answers will be penalized. (a) How does stemming typically affect recall? Why? (b) Given at least two reasons why language identification is important when indexing documents. (c) Why specialized algorithms are needed to construct inverted index for large docu￾ment collections? (d) What are the largest gap that can be encoded in 2 bytes using variable byte code? You also need to show the encoded two bytes. (e) Why is cosine a better similarity metric than the inverse of Euclidean distance in vector space model? (f) Why is vector space model generally considered a better retrieval model than the boolean model? (g) List the advantage(s) of using NDCG to evaluate Web search results over measures such as MAP. (h) List one problem with the probabilistic ranking principle. (i) In the early age of Web search engines (definitely pre-Google era), some system uses the following term frequency weighting 2·tf 2+tf to fight spam Web pages. Explain why this worked. (j) What is a “shingle”, and describe briefly the shingling method to detect near du￾plicate documents. (k) List at least three requirements that complicate the design and implementation of an industrial strength crawler. (l) Define the terms “hub” and “authority” in the context of the HITS algorithm. Can a page be both a hub and authority page at the same time? COMP6714 Page 1 of 11 Question 2 (5 marks) Consider the algorithm (from the textbook) to intersect two postings lists p1 and p2. Algorithm 1: Intersect(p1, p2) answer ← ∅; 1 while p1 = nil and p2 = nil do 2 if docID(p1) = docID(p2) then 3 Add(answer, docID(p1); 4 p1 ← next(p1); 5 p2 ← next(p2); 6 else if docID(p1) < docID(p2) then 7 p1 ← next(p1); 8 else 9 p2 ← next(p2); 10 return answer; 11 (a) What is the time complexity of the algorithm? (b) Modify the algorithm so that it can answer queries like A AND NOT B in time O(|p1| + |p2|), where A and B are two terms. (c) Is it possible to modify the algorithm so that it can answer queries like A OR NOT B in time O(|p1| + |p2|)? If not, what complexity can you achieve? COMP6714 Page 2 of 11 Question 3 (10 marks) Consider a casual user who input the boolean query “A OR B AND C”. Our system deems the query as ambiguous, as either the OR or the AND operator can be executed first. To be on the safe side, the system decides to retrieve those results that belong to both interpretations only (i.e., no matter which interpretation the user intended, it will include our system’s result). Describe how to support such query efficiently by accessing the inverted lists of tokens A, B, and C at most once. COMP6714 Page 3 of 11 Question 4 (10 marks) From the following sequence of γ-coded gaps, reconstruct first the gap sequence and then the postings sequence (assume that docid starts from 1). Note that spaces were deliberately added for clarity purpose only. You need to illustrate your steps. 1110 1101 1111 1001 0111 1111 1110 1000 1111 1001 COMP6714 Page 4 of 11 Question 5 (10 marks) The figure below shows the output of two information retrieval systems on the same two queries in a competitive evaluation. The top 15 ranks are shown. Crosses correspond to a document which has been judged relevant by a human judge; dashes correspond to irrelevant documents. There are no relevant documents in lower ranks. System 1: System 2: Rank Q1 Q2 1 - X 2 X - 3 X - 4 X - 5 - - 6 - - 7 - - 8 X - 9 X - 10 X - 11 X - 12 - - 13 - X 14 - X 15 X - Rank Q1 Q2 1 X X 2 X - 3 X - 4 - X 5 X X 6 X - 7 - - 8 - - 9 - - 10 - - 11 X - 12 X - 13 - - 14 - - 15 X - (a) Explain the following evaluation metrics and give results for query Q1 for both systems. 1. Precision at rank 10. 2. Recall at precision 0.5. (b) The metrics in part (a) above are not adequate measures of system performance for arbitrary queries. Why not? What other disadvantages do these metrics have? (c) Give the formula for mean average precision (MAP), and calculating MAP for both systems. (d) For each system, draw a precision-recall curve. Explain how you arrived at your result. COMP6714 Page 5 of 11 Question 6 (10 marks) Determine the new query vector determined by the Rocchio relevant feedback algorithm (α = β = γ = 1.0), given that the initial query is “t1 t3” and we have the following documents and user feedback. docid t1 t2 t3 t4 feedback 1 2 1 0 0 R 2 3 2 1 0 NR 3 0 3 0 3 R 4 2 1 2 2 NR 5 0 1 2 3 NR Note: “R” standards for relevant and “NR” stands for non-relevant. COMP6714 Page 6 of 11 Question 7 (10 marks) (a) State and justify briefly the assumptions made to derive Equations (3) from (2) and Equation (6) from (5) in the Binary Independence Model. (b) State which values need to be estimated for a document collection in the final Equation (8) (i.e., other parts can be discarded safely without affecting the ranking). Let ~x be the binary term incidence vector representing document D, O(p) be the odd ratio of probability p, Q be the query, R and NR stand for “relevant” and “non-relevant”, respectively, V is the vocabulary. In addition, we use the shorthand notations: pi = p(xi = 1|R, Q) and ri = p(xi = 1|NR, q). O(R|Q, ~x) = p(R|Q, ~x) p(NR|Q, ~x) (1) = p(R|Q) p(NR|Q) · p(~x|R, Q) p(~x|NR, Q) (2) = O(p(R|Q)) · |V | Yi=1 p(xi|R, Q) p(xi|NR, Q) (3) = O(p(R|Q)) · Yxi=1 p(xi = 1|R, Q) p(xi = 1|NR, Q) · Yxi=0 p(xi = 0|R, Q) p(xi = 0|NR, Q) (4) = O(p(R|Q)) · Yxi=1 pi ri · Yxi=0 1 鮶 pi 1 鮶 ri (5) = O(p(R|Q)) · Yxi=1,xi∈Q pi ri · Yxi=0,xi∈Q 1 鮶 pi 1 鮶 ri (6) = O(p(R|Q)) · Yxi=1,xi∈Q pi ri · Qxi∈Q 1鮶pi 1鮶ri Qxi=1,xi∈Q 1鮶pi 1鮶ri ! (7) = O(p(R|Q)) · Yxi=1,xi∈Q pi(1 鮶 ri) ri(1 鮶 pi) · Yxi∈Q 1 鮶 pi 1 鮶 ri (8) COMP6714 Page 7 of 11 Question 8 (5 marks) Suppose we have a document collection with an extremely small vocabulary with only 6 words w1, w2, . . . , w6. The following table shows the estimated background language model p(w|C) using the whole collection of documents (2nd column) and the word counts for document d1 (3rd column) and d2 (4th column), where c(w, di) is the count of word w in document di . Let Q = {w1, w2, w3, w4} be a query. Word p(w|C) c(w, d1) c(w, d2) w1 0.800 2 7 w2 0.100 3 1 w3 0.025 1 1 w4 0.025 2 1 w5 0.025 2 0 w6 0.025 0 0 (a) Suppose we do not smooth the language model for d1 and d2. Compute the likeli￾hood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute the log-likelihood. You should use the scientific notation (e.g., 0.0061 should be 6.1 × 1003 ) Which document would be ranked higher? (b) Suppose we now smooth the language model for d1 and d2 using the Jelinek-Mercer smoothing method with λ = 0.8 (i.e., p(w|d) = λ·pmle(w|Md)+(11λ)·pmle(w|Mc)). Recompute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute the log-likelihood. You should use the scientific notation) Which document would be ranked higher? COMP6714 Page 8 of 11 Question 9 (10 marks) Consider the following web graph: Page A points to page B, C, and D. Page B points to C and D. Page C points to A and E. Page D points to E and F. Page E points to G. Page F points to G and H. Consider a crawler that starts from page A (a) Give the order of the indexing, assuming the crawler uses a URL frontier with duplicate detection, and all the pages are at different web sites. (b) Assume pages B, C, F, H are on web site α, pages D, E, G are on web site β, and page A is on web site γ. The politeness policies on these three web sites all specify at least 3 seconds between each visit (i.e., if the crawler visit a web site at the i second, the earliest time it can revisit the web site is the i + 3 second). We assume that (1) the crawler can only fetch a page every one second, and all the processing (including physically getting the page, extracting and processing the links, etc.) can be completed before the next fetch. (2) the crawler process links in the order mentioned above. The crawler still uses a ULR frontier with duplicate detection, and also uses back queues to adhere to the politeness policies. Give the order of the indexing. (If two pages can be visited at the same time, we always choose the smaller one according to the alphabetical order) COMP6714 Page 9 of 11 Question 10 (10 marks) V W X Y Z (a) Explain the concept of PageRank, and how it is calculated in practice. (b) Why is it relevant for Web search? (c) Give, and briefly explain, the corresponding matrix notation of the PageRank com￾putation. (d) Show the final matrix that will be used for the PageRank calculation for the above graph, if the random teleporting probability is 0.2. (e) Perform two iterations starting from the initial probability distribution vector of (0.2, 0.2, 0.2, 0.2, 0.2). COMP6714 Page 10 of 11 BONUS Question 11 (5 marks) Explain analytically why galloping search (aka. double binary search) is preferred to the normal binary search when implementing the skipTo(docid) method on a sorted list of docids. Make sure you state clearly the meaning of variables and any assumption you use. END OF EXAM PAPER