考试代写-INFR11145 TEXT
时间:2022-05-06
UNIVERSITY OF EDINBURGH
COLLEGE OF SCIENCE AND ENGINEERING
SCHOOL OF INFORMATICS
INFR11145 TEXT TECHNOLOGIES FOR DATA SCIENCE
Tuesday 27 th April 2021
13:00 to 15:00
INSTRUCTIONS TO CANDIDATES
1. Note that ALL QUESTIONS ARE COMPULSORY.
2. DIFFERENT QUESTIONS MAY HAVE DIFFERENT NUMBERS
OF TOTAL MARKS. Take note of this in allocating time to questions.
3. This is an OPEN BOOK examination.
MSc Courses
Convener: A.Pieris
External Examiners: W.Knottenbelt, M.Dunlop, E.Vasilaki.
THIS EXAMINATION WILL BE MARKED ANONYMOUSLY
Please provide answers to ALL of the following questions. It is more impor-
tant to have a clear answer than a longer answer. When calculations are needed,
please show steps and give answer to three decimal digits.
1. What is positional index? Is it always needed for all search scenarios? Why? [4 marks ]
2. A startup built a search engine for a collection of documents. A set of queries
were prepared to evaluate the effectiveness of the system, but there was not
enough budget to build an extensive qrels set. For each of the following evaluation
metrics, is it an efficient choice to use given the absence of qrels, and why? [6 marks ]
(a) P@10.
(b) R@10.
(c) R-Precision.
3. In a permuterm index: [6 marks ]
(a) How would the term ”exertion” be indexed? (assume no stemming applied).
(b) if a user submitted the wild-card search term “ex*tion”, what should be the
query to be executed to find possible matching terms?
(c) Why is wild-card search slower than complete word search?
4. Two web search engines: system X and system Y, both use the same retrieval
models. Search engine X indexed a random unbiased sample of 30% of the web,
while system Y indexed 100% of the web. Assume running the same query on
both search engines, and each system has complete qrels of its collection. Which
of the systems should achieve a higher score for each of the following metrics?
Or, if both systems should achieve a similar score, write “similar”. [4 marks ]
(a) P@10 (b) R@20
(c) AP (d) Precision at R=50%
5. For each of the following preprocessing steps, give an example of a situation
where performing that step would likely reduce the effectiveness of a downstream
system and provide a justification for why that might happen: [5 marks ]
(a) Stopword removal
(b) Stemming
(c) Lowercasing/case-folding
(d) Removing URLs
(e) Removing non-alphanumeric characters
Page 1 of 3
6. Assume that you are designing a classifier to predict whether a news article be-
longs to one of the following categories: politics, sport, science, or entertainment.
The data are distributed as shown in the following table:
category num docs
politics 648
sport 151
science 115
entertainment 224
(a) Assume you want to achieve the highest possible recall score for the class
science: [4 marks ]
i. What is the optimal approach you can take?
ii. What recall (R) and precision (P) scores for the science class would you
achieve with this approach?
iii. What would be the Macro-averaged recall score across all classes?
(b) Now assume that you want to create a simple baseline that can achieve a
high overall accuracy. [5 marks ]
i. What could be this baseline approach? What will be the accuracy?
ii. What will be the macro F1-score for this approach? (show steps).
(c) A classification model was built and resulted in the following confusion ma-
trix: [6 marks ]
actual label
politics sport science entertainment
politics 500 1 9 2
predicted sport 37 86 5 100
label science 62 14 95 2
entertainment 49 50 6 120
i. Compute the precision, recall, and F1-score for the class “entertain-
ment” given results above.
ii. Given the above classification output, what would be the advantages
and disadvantages of combining all instances of the classes “sport” and
“entertainment” into a single new category called “sport & entertain-
ment”? Would you recommend this modification to the dataset, and
why?
Page 2 of 3
7. Given the following set of documents:
d1: hop frog nice frog d5: pond good good frog
d2: frog pond watch hop d6: go good good pond
d3: good good good frog d7: watch pond hop frog
d4: good nice frog go d8: nice nice quiet pond
(a) Suppose documents are classified into two classes, where C1 = {d1, d2, d3,
d4} and C2={d5, d6, d7, d8}. Using Mutual Information (MI), [4 marks ]
i. What is the most distinctive term for each class?
ii. What is the least distinctive term for each class?
Show the steps taken to reach this conclusion. (If there is a tie, list all terms
that are tied)
(b) Assume the following topics have been learned from the data using an LDA
model with 3 topics. The probability for each word belonging to each topic
given the learned model parameters is: [4 marks ]
word topic 1 prob. topic 2 prob. topic 3 prob.
quiet .09 .40 .05
good .03 .25 .10
nice .03 .20 .05
go .05 .10 .15
pond .15 .01 .35
frog ? ? ?
hop .25 .01 .10
watch .10 .02 .10
And the probability of each topic for document d1 (from above) is:
document topic 1 prob. topic 2 prob. topic 3 prob.
d1 0.4 ? 0.4
Where ‘?’ represents an unknown value that you may need to calculate based
on the other information provided. What probability would be assigned
to the document d1 by this LDA model assuming p(θ|α) = 0.4 and z =
{t1, t1, t2, t3} (that is, the first two words of the document come from topic
1, the third from topic 2, and the fourth from topic 3)? Show your steps.
(c) Consider the following two corpora: C1 contains a variety of newspaper
articles from all sections of the newspaper, while C2 contains only articles
from the fashion section of the newspaper. If you run an LDA model on
each corpus, which do you expect will have a set of topics that is easier to
distinguish between and more coherent for a human inspecting the topics?
Explain why you think that will be the case. [2 marks ]
Page 3 of 3
essay、essay代写