程序代写案例-CS1210
时间:2022-05-03
CS1210 Computer Science I: Foundations
Homework 3: Presidents!
Due Friday, May 13 at 11:59PM
Late with max 75% credit until Monday, May 17 at 11:59PM
Introduction
In this project, we’re going to be doing some text analysis. The purpose of this project is to build the
computational machinery required to efficiently compare the language used in these documents.
Ultimately, we’d like to be able to characterize documents by the language they use, which means being
able to measure if a document is similar to another. Here, you will be asked to characterize the most
similar documents, and then to characterize some new text by which document it is most similar to. The
basis we will use for comparison is called the vector space model, and was originally introduced by
Gerard Salton in 1975 — when there was precious little electronic text around to characterize! It is still
one of the models used by information retrieval and/or text analysis systems even today, although there
are more sophisticated methods available.
The basic idea is that each document can be represented by an (ordered) list of frequencies, where each
frequency corresponds to the number of times its corresponding word appears in the text. If you think of
this list of N numbers as an N-dimensional vector, then the notion of a "vector space" makes more sense:
a document is just a vector in this N dimensional space. We’ll worry about exactly which words are
counted and appear in the vector in just a minute, but assuming that the vector is normalized so as to be 1
unit long, two vectors representing documents with very similar word content will have vectors that point
in approximately the same direction.
Now this leaves a lot of details to be pinned down. First, which words are included in the vector? How do
we normalize vectors to unit length? And are these raw frequencies? Or are they also normalized in some
way? How do I compare two vectors? These details are really important, and we’ll have to settle them
before we can start coding.
Data
I hav e obtained from an online archive the text of 894 presidential speeches. Some of these speeches are
very short (President Reagan’s address at Moscow State University on May 31, 1988 is only 72 words
long) and some are very long (President Lincoln’s three hour long speech arguing against the
Kansas/Nebraska Act on October 16, 1854 clocks in at 32,991 words). I have put the text through a regex
filter to remove most of the cruft, but there are still likely to be some unrecognized or missing characters
(some of these files were generated by optical character recognition software, which can make this kind of
mistake).
I hav e provided you a zip file containing 164 selected speeches from 41 presidents (each president has 4
selected speeches) as well as a set of 5 "unknown" speeches to identify by president.
Document Similarity
We will be using the tf/idf (term frequency/inverse document frequency) weighting scheme originally
proposed by Salton and his students for our vectors. Each vector v can be represented as a list of N terms,
where a term is a word and N is a prespecified number of terms we will target in characterizing a
document. In general, one picks the N most frequently used words or word stems over the entire
document collection or corpus; as N gets larger, we get better performance, at least until the words
become so infrequent as to add little to performance.
The tf/idf weighting scheme is defined as follows:
v = [w0, w2, w3 . . . w(N−1)]
and, for a given document, each weight wi is computed as follows:
wi = tfi * log
|D|
1 + {d ∈ D|ti ∈ d}
Here, for term ti , tfi is related to the number of times the term appears in the current document, |D| is the
number of documents in the entire corpus D, and {d ∈ D|ti ∈ d} is the number of documents that
contain the term ti . The 1 in the denominator saves us from a divide-by-zero issue; the logarithm can be to
any base and serves only as a scaling factor. Note that tfi is not usually just a word count, but is
normalized by the length of the document, hence:
tfi =
number of times ith term appears in document
length of document
If we characterize each document by its vector v, we can compute a measure of similarity by measuring
the cosine similarity of the two vectors, which is simply the dot product of the vectors divided by the
product of their respective magnitudes. The dot product of two vectors v1 and v2 represented as indexable
tuples or lists is simply:
0≤iΣ v1[i] × v2[i]
A dot product of 1 means the two documents have exactly the same word tf/idf values, while a dot
product of 0 means the two documents share no words in common.
To a first approximation, this model does a pretty good job of ranking similar documents. Of course, it
does have its limitations. For example, the model loses any idea of the order in which the words actually
occur, which is why this is sometimes called a bag-of-words approach.
What To Do
You hav e been provided a template file for HW3 with nine incomplete methods or functions. In each case,
the proper function of these missing methods or functions is described in the comments. Your job is to
complete each method or function to obtain a working system.
What It Does
When completed, the system reads in a collection of speeches, constructing representations for each
speech.
>>> c=C o r pu s ( )
>>> f o r p i n P :
f o r i i n r a ng e ( 4 ) :
c . a ddSp e e c h ( p+ s t r ( i ) )
>>> l e n ( c . s p e e c h e s )
164
>>> l e n ( c . s p e e c h e s [ ’ ob ama 0 ’ ] )
>>> l e n ( c . s p e e c h e s [ ’ ob ama 0 ’ ] . t ex t )
31743
>>> l e n ( c . s p e e c h e s [ ’ ob ama 0 ’ ] . s e n t e n c e s )
278
2 Revised April 26, 2022
>>> l e n ( c . s p e e c h e s [ ’ ob ama 0 ’ ] . wo r d s )
5644
>>> c . s p e e c h e s [ ’ ob ama 0 ’ ] . wf r e q [ ’ ame r i c a ’ ]
3
>>> c . wf r e q [ ’ ame r i c a ’ ]
903
>>> c . d f r e q [ ’ ame r i c a ’ ]
119
Given an unknown speech, ’unknown0’, also located in the same directory as the rest of the presidential
speeches, I can ask the system to identify the 4 closest speeches in the corpus using a vector lenght of
1000 words:
>>> c . i d e n t i f y ( ’ unknown 0 ’ , 1000 , 5 )
[ ( 0 . 6386678582486209 , ’ hoove r 3 ’ ) ,
( 0 . 5416029092305259 , ’ c oo l i dg e 0 ’ ) ,
( 0 . 5109394742920141 , ’ c oo l i dg e 2 ’ ) ,
( 0 . 504369226557463 , ’ c oo l i dg e 1 ’ ) ,
( 0 . 49166195737076746 , ’ bh a r r i s on2 ’ ) ]
which indicates that the speech ’hoover3’ is the closes by this measure (the unknown speech was also, in
fact, given by Herbert Hoover).
3 Revised April 26, 2022