SECU0057-R代写|学霸联盟

SECU0057-R代写

时间：2023-05-01

UCL DEPARTMENT OF SECURITY AND CRIME SCIENCE
Week 4 – Text Mining II
SECU0057: Applied Data Science
Nilufer Tuptuk
• N-grams
• Keywords-in-context analysis
• Parts-of-speech
• Sentiment analysis
• Trajectory analysisE SCIENCE
Plan for today
Department of Security and Crime Science
2
• Sometimes sequence of words may contain more information
• “ice cream”, “crime science”, “bus stop”, “stolen bicycle”, “metal theft”
• “was not”, “not good”, “not helpful” -> retain more information than
individual words “was”, “not”, “good”
• “I am going out tonight to see a play” vs “I play tennis every weekend”
• By tying a word with its surrounding words, we may retain more information
Sequence of words
Department of Security and Crime Science
3
• N-gram is a consecutive sequence of n elements extracted from a text
• Elements can be words, syllables, characters and symbols
• In this module we are interested in n-grams of words
“Not all crime is reported to police”
• Unigram: n=1
• Bigram: n= 2
• Triagram: n=3
• four-gram, five-gram, …infinity
N-grams
Department of Security and Crime Science
DNA-Representation with N-grams
not all crime is reported To police
not all all crime crime is is reported reported to to police
not all crime all crime is crime is reported is reported to reported to police
4
library(quanteda)
sentence <- tokens("Man jailed for life after knife attacks")
#set n=2 for bigram, n=3 for trigram, n=4 four-gram, etc.
token_sentence <- tokens_ngrams(sentence, n=3 2) #error: should be 2
dfm(x = token_sentence, tolower = TRUE)
Document-feature matrix of: 1 document, 6 features (0.00% sparse) and 0
docvars.
features
docs man_jailed jailed_for for_life life_after after_knife knife_attacks
text1 1 1 1 1 1 1
N-grams with Quanteda
Department of Security and Crime Science
5
# remove punctuation and stopwords
youth <-readtext("PreventingYouthViolence.txt")
gang <-readtext("EndingGangViolence.txt")
youth_violence <-corpus(c(youth=youth$text, gang=gang$text))
prevent_biagrams <- youth_violence %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("english"))%>%
tokens_ngrams(2) %>% # can give a range e.g., n= c(2,4)
dfm()
Corpus created using text from UK government publications on youth and gang violence:
See Reference 1 and Reference 2
With some pre-processing
Department of Security and Crime Science
vast_
majority
majority_
young
young_
people
people_
education
education_
establishments
establishments_
affected
affected_
serious
youth 1 1 6 1 1 1 1
gang 0 0 16 0 0 0 0
6
# create ngrams for all us presidential speeches (i.e.,
data_corpus_inaugural) in the corpus after removing punctuation and
stopwords. The dimensions of dfm for ngrams are as follows:
dim(presidents_unigrams)
## [1] 59 9,285
dim(presidents_bigrams)
## [1] 59 57,727
dim(presidents_multingrams) #i.e., n=c(2,4)
## [1] 59 123,108
• Selecting n
• generally smaller n (biagram or triagram) work well
• four-gram and five-gram might be useful when you have large data sets
• How important a specific n-gram is to a specific document within a corpus?
What happens when we increase n?
Department of Security and Crime Science
7
dfm_tfidf(prevent_biagrams, scheme_tf = 'count', scheme_df = 'inverse')
TFIDF with n-grams
Department of Security and Crime Science
• A bag-of-n-grams model: preserving more context
• Common to create tfidf with n-grams as features to use in predictive modelling (e.g. Machine
Learning)
vast_
majority
majority_
young
young_
people
people_
education
education_
establishments
establishments_
affected
affected_
serious
youth 0.30103 0.30103 0 0.30103 0.30103 0.30103 0.30103
gang 0 0 0 0 0 0 0
8
Key Words in Context (KWiC)
Department of Security and Crime Science
• Keywords-in-context (kwic()) displays the concordances lines with the
keyword in the middle along with nearby words
• Helps to gain insight into how a word or phrase is used in a corpus,
how frequently and in which context.
kwic_word <- kwic(x, pattern, window=5)
9
Key Words in Context (KWiC)
Department of Security and Crime Science
violence_tokens <- tokens(youth_violence, remove_punct = TRUE)
kwic_word <- kwic(violence_tokens, pattern = "school", window =3)
head(kwic_word)
[youth, 258] order for a | school | to be judged
[youth, 271] feel safe at | school | all the time
[youth, 396] or specialist interventions | School | and college leaders
[youth, 591] The guidance signposts | school | and college staff
[gang, 1885] smooth transition from | school | to work training
[gang, 1918] action to improve | school | attendance as regular
[gang, 1961] message that attending | school | should be a
10
• To process consecutive sequences of words: n-grams
• To weigh words: TFIDF
• To identify the immediate context: KWiC
• To identify the lexical category of words: Parts-of-Speech (POS)
• Classifies a word with its corresponding part of speech
• (e.g., nouns, verbs, pronouns, adjectives, adverbs, and many
more):
Parts of Speech (POS)
Department of Security and Crime Science
11
• Process of assigning tags to words with its corresponding part of speech
• context of the word is required to identify its POS
• Example: the first attack is a common noun and the second one is a verb
• POS tagging is used to compare the grammar of different texts, grammar
corrections, auto-complete, help to translate text from one language to
another and create more specific features in a document.
POS tagging
Department of Security and Crime Science
Created using CoreNLP: https://corenlp.run
12
• The Penn tagset developed by the Penn Treebank Project is a standard tagset
for speech tagging in English
• The latest Penn treebank available here
• Consists of 36-45 parts of speech tags
• Paper: Marcus, Santorini and Marcinkiewicz, "Building a Large Annotated
Corpus of English: The Penn Treebank", 1993.
• The Google Universal Part-of-Speech tagset
• Contains 11 parts of speech tags
• Paper: Petrov, D Das and McDonald, "A Universal Part-of-Speech Tagset", 2011
• Language-specific tagsets available in other languages
POS Tagsets
Department of Security and Crime Science
13
Penn Treebank tagset (1/2)
Department of Security and Crime Science
POS Tag Description Example
CC coordinating conjunction and, or, for, nor but
CD cardinal number 1, two, three
DT determiner a, the
EX existential there there is
FW foreign word les
IN
preposition, subordinating
conjunction
in, of, like
IN/that that as subordinator that
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables
NP proper noun, singular John
NPS proper noun, plural Vikings
PDT predeterminer both the boys
POS possessive ending friend’s 14
Penn Treebank tagset (2/2)
Department of Security and Crime Science
PP personal pronoun I, he, it
PPZ possessive pronoun my, his
RB adverb however, usually, naturally, here
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
SENT sentence-break punctuation . ! ?
SYM symbol / [ = *
TO infinitive ‘to’ to go, to give
UH interjection uhhuhhuhh
VB verb be, base form be
VBD verb be, past tense was, were
VBG verb be, gerund/present participle being
VBN verb be, past participle been
VBP verb be, sing. present, non-3d am, are
VBZ verb be, 3rd person sing. present is
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
15
The Universal part-of-speech tagset
Department of Security and Crime Science
Verb verbs (all tenses and modes)
NOUN nouns (common and proper)
PRON pronouns
ADJ adjectives
ADV adverbs
ADP adpositions (prepositions and postpositions)
CONJ conjunctions
DET determiners
NUM cardinal numbers
PRT particles or other function words
X other: foreign words, typos, abbreviations
. punctuation
16
• POS tags are generated using Python libraries spaCy and NLTK, and Java
libraries OpenNLP and CoreNLP
• Homework: Use R wrapper package spacyr
• you need a spaCy installation
• A guide to using and installing spacyr
• Uses Universal Dependencies (UD) PoS Tagset based on the Google
universal part-of-speech tagset.
POS tagging in R
Department of Security and Crime Science
17
library(spacyr)
parsed <-spacy_parse(vouth_violence, pos = TRUE, tag=FALSE)
POS tagging in R
Department of Security and Crime Science
doc_id sentence_id token_id token pos
youth 1 1 The DET
youth 1 2 vast ADJ
youth 3 10 college NOUN
youth 5 36 educational ADJ
youth 8 6 recognised VERB
youth 10 9 , PUNCT
youth 8 10 been AUX
youth 16 23 of ADP
gang 5 41 future ADJ
gang 7 38 gang NOUN
gang 7 39 And CCONJ
Universal POS tags used by Spacy available here 18
parsedtxt <- spacy_parse(youth_violence, entity = TRUE, nounphrase =
TRUE)
entity_extract(parsedtxt)
## doc_id entity entity_type
youth 10 Families ORG
youth 26 Pupil ORG
youth 30 Ending_Gang ORG
youth 30 Youth NORP
gang 91 England GPE
gang 89 BIS ORG
gang 91 Cabinet_Office ORG
gang 91 Power_Up_London WORK
gang 91 Power_Up_Liverpool WORK
gang 78 the_Home_Office ORG
gang 79 DWP ORG
NORP: Nationalities or religious or political groups , GPE: Countries, cities, states
ORG: Organisation
• More functions including dependency parsing see spacyr documentation
Extracting entities
Department of Security and Crime Science
19
Sentiment Analysis
Department of Security and Crime Science
20
• Sentiment is “a thought, opinion, or idea based on a feeling about a
situation, or a way of thinking about something.” (Cambridge Dictionary)
• Sentiment analysis (sometimes called opinion mining), studies the opinion,
attitudes and emotions of a writer towards a subject matter (e.g., an entity,
or event)
• Some potential applications for security and crime science
• Examining discussions, public opinion or reactions related to security and
crime
• Predicting crime patterns, potential signals for online radicalisation
Sentiment Analysis
Department of Security and Crime Science
21
1. Tokenise text
2. Create a lexicon of sentiment words
3. Judge the sentiment of words
4. Match tokens with sentiment lexicon
Main steps in sentiment analysis
Department of Security and Crime Science
22
my_tokens <- tokens("We also know of cases where gang members have
been waiting outside schools to meet children…")
my_tokens
Tokens consisting of 1 document.
text1 :
[1] "We" "also" "know" "of" "cases" "where" "gang"
"members" "have" "been" "waiting" "outside"
[ ... and 4 more ]
1. Tokenise text
Department of Security and Crime Science
23
• Do all words have a potential sentiment?
• Not all words (e.g., neutral words) may have a sentiment
• You may wish to focus on adjectives/adverbs
• Sentiment lexicons exist
• Lists of various lexicons available here
• Lexicons might be designed for a specific intended usage (e.g., crime,
humanities, financial text, etc. )
2. Create a lexicon of sentiment words
Department of Security and Crime Science
24
Examples of Sentiment Lexicons
Department of Security and Crime Science
accusation negative
bless positive
erase negative
fancy positive
impatient negative
motivated positive
Bing (6789 words)
tidytext package provides four general-purpose lexicons
get_sentiments("bing")
accusation negative
cyberattack negative
deviate uncertainty
lowest strong modal
Apparently weak modal
bankrupt complexity
Loughran (4150 words - financial text)
breathtaking 5
elegant 2
desire 1
bullying -2
cover-up -3
prick -5
AFINN (2477 words) – numerical
scores for negative and positive from -
5 to +5
25
Examples of Sentiment Lexicons II
Department of Security and Crime Science
The NRC Emotion Lexicon
• Developed by National Research Council Canada
• Categories of positive or negative anger, anticipation, disgust, fear, joy, sadness) –
contains 14182 words
bless positive
worried negative
accusation anger
enjoying
anticipation
bullying fear
blitz surprise
establish
trust
wonderfully
joy
dreadfully
disgust
penalty
sadness
26
Examples of Sentiment Lexicons III
Department of Security and Crime Science
lexicon package
head(lexicon::
hash_sentiment_slangsd, 10)
## x y
## a a -0.5
## a bad amount of money -0.5
## a bad mother -0.5
## a bad taste in my mouth -0.5
## a batman 0.5
## a bitch -0.5
## a heavy 0.5
## achievement hunter 1.0
head(lexicon::
hash_sentiment_socal_google, 10)
## y
## a pillar 2.9045431
## ab liva -0.9578700
## able 2.6393740
## above average 3.2150018
## above mentioned 2.5815803
## abrasive 1.6751913
## absent 0.4850649
## absurd -3.7778744
27
3. Judge the sentiment of words
Department of Security and Crime Science
• Manually by experts (e.g., authors of the lexicons)
• Crowdsourcing (for example, using services such as Amazon’s
Mechanical Turk)
• Decide on a judgement scale (e.g., binary, scale)
• Asking each annotator to judge each word (multiple judgments)
• Assess their inter-rater reliability and compute an overall sentiment
value
28
4. Match tokens with sentiment lexicon
Department of Security and Crime Science
• syuzhet package provides one sentiment score for each text
library(syuzhet)
example_text <- "I’m so upset. I’ve just opened my parcel and my book is
damaged. The corners are ruined and there’s a large scratch mark across
the back cover. "
get_sentiment(example_text)
## -2.6
• Sentiment for each sentence
example_text <- get_sentences("I’m so upset. I’ve just opened my parcel
and my book is damaged. The corners are ruined and there’s a large
scratch mark across the back cover.")
get_sentiment(example_text)
## [1] -0.75 -0.50 -1.35
29
4. Match tokens with sentiment lexicon II
Department of Security and Crime Science
• A limitation of syuzhet
library(syuzhet)
get_sentiment("This book is great.", method = "syuzhet")
## 0.5
get_sentiment("This book not great.", method = "syuzhet")
## 0.5
get_sentiment("great", method = "syuzhet")
## 0.5
get_sentiment("This book is great.", method = "bing")
## 1
get_sentiment("This book is not great.", method = "bing")
## 1
get_sentiment("This book is great.", method = "afinn")
## 3
get_sentiment("This book is not great.", method = "afinn")
## 3 30
4. Match tokens with sentiment lexicon III
Department of Security and Crime Science
• A more nuanced approach that accounts for negators is the sentimetr
package
library(sentimentr)
sentiment("not great") # takes into account negation
## -0.3535534
example_text <- "I’m so upset. I’ve just opened my
parcel and my book is damaged. The corners are ruined
and there’s a large scratch mark across the back
cover."
sentiment(example_text)
element_id sentence_id word_count sentiment
## 1: 1 1 4 -0.3750000
## 2: 1 2 11 -0.1507557
## 3: 1 3 15 -0.3485685 31
Valence shifters
Department of Security and Crime Science
• Nearly all sentiment analysis packages in R fail to recognise valence shifters
except sentimentr:
• Negators (e.g., I do not like it)
• Amplifiers (intensifiers) (e.g., “I really like it.”)
• De-amplifiers (downtoners) (e.g., “I hardly like it.”)
• Adversative conjunctions) (e.g., “I like it but it’s not worth it.”)
Text Negator Amplifier Deamplifier Adversative
Cannon reviews 21% 23% 8% 12%
2012 presidential
debate
23% 18% 1% 11%
Trump speeches 12% 14% 3% 10%
Trump tweets 19% 18% 4% 4%
Dylan songs 4% 10% 0% 4%
Austen books 21% 18% 6% 11%
Hamlet 26% 17% 2% 16%
Source and code available at Cran sentrimentr page
32
A limitation of nuanced approaches
Department of Security and Crime Science
Sentiment for each sentence
• needs punctuated data but, data is often not punctuated (e.g., texts on social
media) or badly punctuated
• without punctuation whole text is seen as one sentence
• requires accurate sentence boundary disambiguation (sentence breaking) and
data might not be suitable (e.g., Twitter data → too short )
33
Sentiment Trajectory Analysis
Department of Security and Crime Science
• Researchers are seeking better ways to analyse and understand sentiments
• UCL Department of Security and Crime Science (Dr. Bennett Kleinberg and
his PhD students) examined sentiment trajectory: flow of texts over
narrative time
• YouTube’s vloggers , YouTube news channels and UK Drill Music
• Inspired by the work of Matthew L. Jockers and the idea that sentiment is
dynamic within texts
• Possible benefits: aid understand the style structure of texts, what makes a
popular speech, detect misinformation, etc.
34
Sentiment Trajectory Analysis
Department of Security and Crime Science
• Kurt Vonnegut proposed a Master’s thesis on “stories have shapes which can
be drawn on graph paper” which was rejected - too much fun and too simple
• Inspired by Vonnegut’s ideas data scientists used sentiment analysis to
examine stories
• emotional arcs of stories are dominated by six basic shapes defined as story types
(rags to riches, riches to rags , Icarus , Oedipus, Cinderella , man in a hole )
Story Frankestein (Shelly)
Identified type: Oedipus
Image generated using R packages,
Source: BBC)
35
Sentiment Trajectories - UCL researchers
Department of Security and Crime Science
Example and images in this and following slides courtesy of Dr. Bennett Kleinberg.
36
Steps in Sentiment Trajectories
Department of Security and Crime Science
1. Parse text into tokens
2. Match sentiment lexicon to each token
• Match valence shifters to each context
• Apply valence shifter weights
• Build a naïve context around the sentiment
• Return a modified sentiment
37
Steps in Sentiment Trajectories
Department of Security and Crime Science
• R script developed by from Bennett Kleinberg https://github.com/ben-
aaron188/naive_context_sentiment/blob/master/ncs_evaluation.R
• sentence_text <- “…it was especially terrifying because her husband passed
away from lung cancer after receiving….“
Parse text into tokens Text Index
it 25
was 26
especially 27
terrifying 28
her 29
husband 30
passed 31
away 32
from 33
lung 34 38
Match sentiment
Department of Security and Crime Science
Check whether any of these words are in a sentiment list
Text Index y
it 25 NA
was 26 NA
especially 27 NA
terrifying 28 -1
her 29 NA
husband 30 NA
passed 31 NA
away 32 NA
from 33 NA
lung 34 NA
39
Match valence shifters
Department of Security and Crime Science
Are there any valence shifters?
Text Index Sentiment Valence
it 25 NA NA
was 26 NA NA
especially 27 NA 2
terrifying 28 -1 NA
her 29 NA NA
husband 30 NA NA
passed 31 NA NA
away 32 NA NA
from 33 NA NA
lung 34 NA NA
1 = negator (not, never, …)
2 = amplifier (very, totally, …)
3 = deamplifier (hardly, barely, …)
4 = adversative conjunction (but,
however, …)
40
Valence shifters
Department of Security and Crime Science
Weights defined:
1 = negator (not, never, …): -1.00
2 = amplifier (very, totally, …): 1.50
3 = deamplifier (hardly, barely, …): 0.50
4 = adversative conjunction (but, however, …): 0.25
41
Sentiment trajectory
Department of Security and Crime Science
Build ‘naive’ context around the sentiment
2 words before the sentiment word and 2 words after
Text Index Sentiment Valence weights
was 26 NA NA 1.0
especially 27 NA 2 1.5
terrifying 28 -1 NA 1.0
her 29 NA NA 1.0
husband 30 NA NA 1.0
42
Calculate Modified Sentiment
Department of Security and Crime Science
Text Index Sentiment Valence Weights Sentiment_score_mod
was 26 NA NA 1.0 0.0
especially 27 NA 2 1.5 0.0
terrifying 28 -1 NA 1.0 - 1.5
her 29 NA NA 1.0 0.0
husband 30 NA NA 1.0 0.0
43
Trajectories
Department of Security and Crime Science
Sentiment score plotted
44
Trajectories
Department of Security and Crime Science
Sentiment score plotted scaled between 1 and -1.
Problem: Every document has different lengths.
How to standardise the length?
45
Solution: Length Standardisation
Department of Security and Crime Science
• Aim: transform all sentiment values to a vector
• standard vector length for comparisons
• Example: 100 values with Discrete Cosine Transformation (DCT)
(explainer on Fourier Transformation)
• 100 was chosen to interpret the results as percentages of vlog
progression time
• Each of the 100 points corresponds to 1% of the length of the text
46
After Discrete Cosine Transformation
Department of Security and Crime Science
47
Trajectories
Department of Security and Crime Science
Beware of the “filter” size, increasing the filter parameter will add more
granularity. Initial default filter size is 5, compare to filter size 20.
Further reading: Recommended texts in preparation reading and
M. Jockers blogpost on revealing sentiment and plot arcs available here
48
What’s next
Department of Security and Crime Science
• Today’s Tutorial
• ngrams, kwic analysis, sentiment analysis, trajectory analysis
Homework:
• POS tagging, trajectory parameters, string processing
• Sentiment trajectories of the Russian-Ukrainian War
• Putin and Zelenksyy's speeches (submit this homework to receive
feedback)