MA331-无代写|学霸联盟

MA331-无代写

时间：2022-12-20

MA331-example-exploration-code-for-the-assignment.R
dbrawn
2022-12-13
# This code contains some example exploration with the
# assignment in mind we start with steps which almost everyone
# will need to perform
# note that the use of "print()" below is purely for rendering into a pdf
# and not needed otherwise
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
library(tidytext)
##
## Attaching package: 'tidytext'
## The following object is masked _by_ '.GlobalEnv':
##
## sentiments
# set your own working directory here
setwd("C:/Users/dbrawn/Desktop/MA331/assignment 2022")
# a modern word list on moodle derived from google and the internet
# so these are really modern words (not Shakespeare ?),
# you could modify this list as a further
# development or take it as it comes
modern_words <- read.delim("modern_word_count.txt")
nrow(modern_words)
## [1] 333333
str(modern_words) # its big too big ? investigate but dont ONLY analyse this file
## 'data.frame': 333333 obs. of 2 variables:
## $ word : chr "the" "of" "and" "to" ...
1
## $ word_count: num 2.31e+10 1.32e+10 1.30e+10 1.21e+10 9.08e+09 ...
names(modern_words) # the word counts are huge themselves
## [1] "word" "word_count"
modern_words%>%arrange(word_count)%>%head(30) # some not really words
## word word_count
## 1 yyt 12711
## 2 yuaoo 12711
## 3 yoooog 12711
## 4 yoohl 12711
## 5 yooal 12711
## 6 yhaoop 12711
## 7 yaoob 12711
## 8 wwwsexdrugs 12711
## 9 wwwphotogalleries 12711
## 10 wwwlips 12711
## 11 wwwhealth 12711
## 12 wwwgoldenpalace 12711
## 13 wwwdebt 12711
## 14 wiseonline 12711
## 15 vrie 12711
## 16 vogole 12711
## 17 vogoe 12711
## 18 triose 12711
## 19 tooolo 12711
## 20 toooga 12711
## 21 tooog 12711
## 22 toogel 12711
## 23 tolloe 12711
## 24 tollge 12711
## 25 tgolle 12711
## 26 referencedesk 12711
## 27 poolge 12711
## 28 platypusmaximus 12711
## 29 oyogo 12711
## 30 otoool 12711
modern_words%>%arrange(desc(word_count))%>%head(30) # many are like "the" or "of" etc
## word word_count
## 1 the 23135851162
## 2 of 13151942776
## 3 and 12997637966
## 4 to 12136980858
## 5 a 9081174698
## 6 in 8469404971
## 7 for 5933321709
## 8 is 4705743816
## 9 on 3750423199
## 10 that 3400031103
## 11 by 3350048871
## 12 this 3228469771
## 13 with 3183110675
2
## 14 i 3086225277
## 15 you 2996181025
## 16 it 2813163874
## 17 not 2633487141
## 18 or 2590739907
## 19 be 2398724162
## 20 are 2393614870
## 21 from 2275595356
## 22 at 2272272772
## 23 as 2247431740
## 24 your 2062066547
## 25 all 2022459848
## 26 have 1564202750
## 27 new 1551258643
## 28 more 1544771673
## 29 an 1518266684
## 30 was 1483428678
King_Lear <- read.csv("King_Lear_words_and_players_only.csv")
str(King_Lear)
## 'data.frame': 3676 obs. of 2 variables:
## $ text : chr " I thought the King had more affected the Duke" "of Albany than Cornwall." " It did always seem so to us, but now in" "the division of the kingdom, it appears not which" ...
## $ player: chr "KENT" "KENT" "GLOUCESTER" "GLOUCESTER" ...
King_Lear%>%head(30)
## text player
## 1 I thought the King had more affected the Duke KENT
## 2 of Albany than Cornwall. KENT
## 3 It did always seem so to us, but now in GLOUCESTER
## 4 the division of the kingdom, it appears not which GLOUCESTER
## 5 of the dukes he values most, for equalities are so GLOUCESTER
## 6 weighed that curiosity in neither can make choice GLOUCESTER
## 7 of either's moiety. GLOUCESTER
## 8 Is not this your son, my lord? KENT
## 9 His breeding, sir, hath been at my GLOUCESTER
## 10 charge. I have so often blushed to acknowledge GLOUCESTER
## 11 him that now I am brazed to 't. GLOUCESTER
## 12 I cannot conceive you. KENT
## 13 Sir, this young fellow's mother could, GLOUCESTER
## 14 whereupon she grew round-wombed and had indeed, GLOUCESTER
## 15 sir, a son for her cradle ere she had a husband GLOUCESTER
## 16 for her bed. Do you smell a fault? GLOUCESTER
## 17 I cannot wish the fault undone, the issue of it KENT
## 18 being so proper. KENT
## 19 But I have a son, sir, by order of law, GLOUCESTER
## 20 some year elder than this, who yet is no dearer in GLOUCESTER
## 21 my account. Though this knave came something GLOUCESTER
## 22 saucily to the world before he was sent for, yet was GLOUCESTER
## 23 his mother fair, there was good sport at his making, GLOUCESTER
## 24 and the whoreson must be acknowledged.--Do you GLOUCESTER
## 25 know this noble gentleman, Edmund? GLOUCESTER
## 26 No, my lord. EDMUND
## 27 My lord of Kent. Remember him hereafter GLOUCESTER
3
## 28 as my honorable friend. GLOUCESTER
## 29 My services to your Lordship. EDMUND
## 30 I must love you and sue to know you better. KENT
data(stop_words) # a reference tibble of stop words in tidy format
stop_words
## # A tibble: 1,149 x 2
## word lexicon
##
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
nrow(stop_words) # the list of stop words is not huge is it big enough ?
## [1] 1149
King_Lear_tidy <- King_Lear %>% # the standard unnest_tokens() converts
# to a tidy format
unnest_tokens(word, text)%>%
anti_join(stop_words) # removal of reference stop words done with anti_join
## Joining, by = "word"
King_Lear_tidy %>%count(word, sort = TRUE)%>%head(50) # count most common words
## word n
## 1 thou 220
## 2 thy 159
## 3 thee 138
## 4 sir 109
## 5 lord 97
## 6 king 67
## 7 father 66
## 8 hath 55
## 9 tis 53
## 10 love 52
## 11 speak 50
## 12 poor 48
## 13 heart 47
## 14 fool 44
## 15 art 41
## 16 eyes 39
## 17 life 36
## 18 nature 36
## 19 madam 32
## 20 edmund 31
## 21 sister 31
4
## 22 night 30
## 23 daughter 29
## 24 time 29
## 25 daughters 28
## 26 son 28
## 27 dost 27
## 28 fellow 27
## 29 gods 26
## 30 hand 26
## 31 hear 26
## 32 pray 26
## 33 master 25
## 34 thine 25
## 35 letter 24
## 36 gloucester 23
## 37 dear 22
## 38 death 22
## 39 duke 22
## 40 france 22
## 41 hast 22
## 42 mine 22
## 43 ay 21
## 44 call 21
## 45 cordelia 21
## 46 grace 21
## 47 hold 21
## 48 stand 21
## 49 brother 20
## 50 head 20
# decide which words not included in the reference stops words should be "Stop"
# words to add. I add just "thou","thy","thee", "tis" as a
# stop words as these are very common
custom_stop_words <- bind_rows(tibble(word = c("thou","thy","thee","tis"),
lexicon = c("custom")),
stop_words) # one syntax for adding to the list
# define your own extra stop words
head(custom_stop_words)
## # A tibble: 6 x 2
## word lexicon
##
## 1 thou custom
## 2 thy custom
## 3 thee custom
## 4 tis custom
## 5 a SMART
## 6 a's SMART
# removal of custom stop words in King_Lear
King_Lear_tidy <- King_Lear_tidy %>%
anti_join(custom_stop_words)
## Joining, by = "word"
5
# sentiment library afinn, one of 3 sentiment libraries; use at least one
get_sentiments("afinn")%>%arrange(value)%>%print(n=30)
## # A tibble: 2,477 x 2
## word value
##
## 1 bastard -5
## 2 bastards -5
## 3 bitch -5
## 4 bitches -5
## 5 cock -5
## 6 cocksucker -5
## 7 cocksuckers -5
## 8 cunt -5
## 9 motherfucker -5
## 10 motherfucking -5
## 11 niggas -5
## 12 nigger -5
## 13 prick -5
## 14 slut -5
## 15 son-of-a-bitch -5
## 16 twat -5
## 17 ass -4
## 18 assfucking -4
## 19 asshole -4
## 20 bullshit -4
## 21 catastrophic -4
## 22 damn -4
## 23 damned -4
## 24 damnit -4
## 25 dick -4
## 26 dickhead -4
## 27 fraud -4
## 28 frauds -4
## 29 fraudster -4
## 30 fraudsters -4
## # ... with 2,447 more rows
get_sentiments("afinn")%>%arrange(desc(value))%>%print(n=30)
## # A tibble: 2,477 x 2
## word value
##
## 1 breathtaking 5
## 2 hurrah 5
## 3 outstanding 5
## 4 superb 5
## 5 thrilled 5
## 6 amazing 4
## 7 awesome 4
## 8 brilliant 4
## 9 ecstatic 4
## 10 euphoric 4
## 11 exuberant 4
6
## 12 fabulous 4
## 13 fantastic 4
## 14 fun 4
## 15 funnier 4
## 16 funny 4
## 17 godsend 4
## 18 heavenly 4
## 19 lifesaver 4
## 20 lmao 4
## 21 lmfao 4
## 22 masterpiece 4
## 23 masterpieces 4
## 24 miracle 4
## 25 overjoyed 4
## 26 rapturous 4
## 27 rejoice 4
## 28 rejoiced 4
## 29 rejoices 4
## 30 rejoicing 4
## # ... with 2,447 more rows
table(get_sentiments("afinn")[,2]) # look at the distribution of values
## value
## -5 -4 -3 -2 -1 0 1 2 3 4 5
## 16 43 264 966 309 1 208 448 172 45 5
get_sentiments("afinn")%>%pull(value)%>%mean() # slight negative mean
## [1] -0.5894227
print(hist(get_sentiments("afinn")%>%pull(value))) # bimodal ?
7
Histogram of get_sentiments("afinn") %>% pull(value)
get_sentiments("afinn") %>% pull(value)
Fr
eq
ue
nc
y
−4 −2 0 2 4
0
20
0
40
0
60
0
80
0
10
00
## $breaks
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
##
## $counts
## [1] 59 264 966 309 1 208 448 172 45 5
##
## $density
## [1] 0.0238191361 0.1065805410 0.3899878886 0.1247476786 0.0004037142
## [6] 0.0839725474 0.1808639483 0.0694388373 0.0181671377 0.0020185709
##
## $mids
## [1] -4.5 -3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5
##
## $xname
## [1] "get_sentiments(\"afinn\") %>% pull(value)"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
# count most common words in the play after stop word removal
King_Lear_tidy %>%count(word, sort = TRUE)%>%head()
## word n
## 1 sir 109
8
## 2 lord 97
## 3 king 67
## 4 father 66
## 5 hath 55
## 6 love 52
# now join with afinn and find weighted effect
# which accounts for the value and the number of times the word appears
King_Lear_words <- King_Lear_tidy %>%count(word, sort = TRUE)%>%
inner_join(get_sentiments("afinn"),"word")%>%
mutate(weighted=n*value)
King_Lear_words%>%head(50)
## word n value weighted
## 1 love 52 3 156
## 2 poor 48 -2 -96
## 3 fool 44 -2 -88
## 4 pray 26 1 26
## 5 dear 22 2 44
## 6 death 22 -2 -44
## 7 grace 21 1 21
## 8 noble 19 2 38
## 9 mad 17 -3 -51
## 10 fear 16 -2 -32
## 11 true 16 2 32
## 12 fire 15 -2 -30
## 13 matter 15 1 15
## 14 dead 14 -3 -42
## 15 ha 14 2 28
## 16 leave 12 -1 -12
## 17 cry 11 -1 -11
## 18 die 11 -3 -33
## 19 fit 11 1 11
## 20 peace 11 2 22
## 21 shame 11 -2 -22
## 22 strange 11 -1 -11
## 23 strike 11 -1 -11
## 24 fair 10 2 20
## 25 hard 10 -1 -10
## 26 honor 10 2 20
## 27 pity 10 -2 -20
## 28 sweet 10 2 20
## 29 worst 10 -3 -30
## 30 justice 9 2 18
## 31 kill 9 -3 -27
## 32 loved 9 3 27
## 33 pardon 9 2 18
## 34 tears 9 -2 -18
## 35 weep 9 -2 -18
## 36 worse 9 -3 -27
## 37 worth 9 2 18
## 38 blame 8 -2 -16
## 39 bless 8 2 16
## 40 fools 8 -2 -16
## 41 heaven 8 2 16
9
## 42 lost 8 -3 -24
## 43 rage 8 -2 -16
## 44 sorrow 8 -2 -16
## 45 abused 7 -3 -21
## 46 care 7 2 14
## 47 grief 7 -2 -14
## 48 honest 7 2 14
## 49 ill 7 -2 -14
## 50 murder 7 -2 -14
print(hist(King_Lear_words%>%pull(weighted))) # not now bi-modal ?
Histogram of King_Lear_words %>% pull(weighted)
King_Lear_words %>% pull(weighted)
Fr
eq
ue
nc
y
−100 −50 0 50 100 150
0
50
10
0
15
0
20
0
25
0
## $breaks
## [1] -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160
##
## $counts
## [1] 2 0 3 12 252 164 7 1 0 0 0 0 1
##
## $density
## [1] 0.0002262443 0.0000000000 0.0003393665 0.0013574661 0.0285067873
## [6] 0.0185520362 0.0007918552 0.0001131222 0.0000000000 0.0000000000
## [11] 0.0000000000 0.0000000000 0.0001131222
##
## $mids
## [1] -90 -70 -50 -30 -10 10 30 50 70 90 110 130 150
##
## $xname
10
## [1] "King_Lear_words %>% pull(weighted)"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
str(King_Lear_words)
## 'data.frame': 442 obs. of 4 variables:
## $ word : chr "love" "poor" "fool" "pray" ...
## $ n : int 52 48 44 26 22 22 21 19 17 16 ...
## $ value : num 3 -2 -2 1 2 -2 1 2 -3 -2 ...
## $ weighted: num 156 -96 -88 26 44 -44 21 38 -51 -32 ...
King_Lear_sentiments <- King_Lear_words$weighted # looking at the weights
mean(King_Lear_sentiments) # slightly neg
## [1] -1.21267
t.test(King_Lear_sentiments) # test if significantly neg
##
## One Sample t-test
##
## data: King_Lear_sentiments
## t = -1.9018, df = 441, p-value = 0.05785
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -2.46585802 0.04051865
## sample estimates:
## mean of x
## -1.21267
# p-value = 0.05785>0.05 so its not significantly non zero
# nrc sentiment library is an alaternative sentiment library
nrc<- get_sentiments("nrc")
table(nrc$sentiment) # 10 sentiment labels for words
##
## anger anticipation disgust fear joy
## 1245 837 1056 1474 687
## negative positive sadness surprise trust
## 3316 2308 1187 532 1230
nrow(nrc) # again is this sufficent to match the play's words:
## [1] 13872
nrow(King_Lear_tidy)
## [1] 9306
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy") # select joy sentiment words
nrc_joy
11
## # A tibble: 687 x 2
## word sentiment
##
## 1 absolution joy
## 2 abundance joy
## 3 abundant joy
## 4 accolade joy
## 5 accompaniment joy
## 6 accomplish joy
## 7 accomplished joy
## 8 achieve joy
## 9 achievement joy
## 10 acrobat joy
## # ... with 677 more rows
king_lear_joy_words <- King_Lear_tidy %>%inner_join(nrc_joy)%>%
count(word, sort = TRUE)%>%head(30) # most popular joy words
## Joining, by = "word"
print(plot(king_lear_joy_words$n))
## NULL
abline(h=10) # maybe say only ten main joyful words in the play here
0 5 10 15 20 25 30
10
20
30
40
50
Index
ki
ng
_l
ea
r_
joy
_w
o
rd
s$
n
king_lear_joy_words%>%arrange(desc(n))%>%head(10) # really joyful ?
## word n
12
## 1 love 52
## 2 art 41
## 3 daughter 29
## 4 pray 26
## 5 friend 19
## 6 true 16
## 7 fortune 14
## 8 child 12
## 9 found 12
## 10 peace 11
bing <- get_sentiments("bing") # a third sentiment library
head(bing)
## # A tibble: 6 x 2
## word sentiment
##
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
table(bing$sentiment) # check how many pos or neg
##
## negative positive
## 4781 2005
# list of negative words from the Bing lexicon.
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
head(bingnegative)
## # A tibble: 6 x 2
## word sentiment
##
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
table(bingnegative$sentiment) # confirming all negative
##
## negative
## 4781
King_Lear_tidy %>%
inner_join(get_sentiments("bing"),"word") %>%count(player,sentiment)
## player sentiment n
## 1 ALBANY negative 51
## 2 ALBANY positive 35
## 3 BURGUNDY negative 2
13
## 4 BURGUNDY positive 3
## 5 CAPTAIN positive 1
## 6 CORDELIA negative 37
## 7 CORDELIA positive 33
## 8 CORNWALL negative 36
## 9 CORNWALL positive 26
## 10 DOCTOR negative 11
## 11 DOCTOR positive 1
## 12 EDGAR negative 207
## 13 EDGAR positive 86
## 14 EDMUND negative 104
## 15 EDMUND positive 80
## 16 FIRST SERVANT negative 4
## 17 FOOL negative 66
## 18 FOOL positive 36
## 19 FRANCE negative 13
## 20 FRANCE positive 14
## 21 GENTLEMAN negative 25
## 22 GENTLEMAN positive 14
## 23 GLOUCESTER negative 139
## 24 GLOUCESTER positive 68
## 25 GONERIL negative 68
## 26 GONERIL positive 42
## 27 HERALD negative 1
## 28 HERALD positive 1
## 29 KENT negative 138
## 30 KENT positive 88
## 31 KNIGHT negative 3
## 32 KNIGHT positive 4
## 33 LEAR negative 316
## 34 LEAR positive 142
## 35 MESSENGER negative 6
## 36 MESSENGER positive 3
## 37 OLD MAN negative 5
## 38 OSWALD negative 30
## 39 OSWALD positive 13
## 40 REGAN negative 56
## 41 REGAN positive 35
## 42 SECOND SERVANT negative 3
## 43 SECOND SERVANT positive 1
## 44 THIRD SERVANT negative 2
## 45 THIRD SERVANT positive 1
# now pivot wider to separate positive and negative
# then split pos and neg into a non tidy format
# then add to above with a pos:neg measure
King_Lear_sentiment_long <- King_Lear_tidy %>%
inner_join(get_sentiments("bing"),"word") %>%count(player,sentiment) %>%
# up to this point count separately + and - per player
# now pivot wider to put positive and negative in separate cols
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
# now define my own measure range -1 to +1
mutate(sentiment_score = (positive-negative)/(positive+negative))%>%
14
arrange(sentiment_score)
King_Lear_sentiment_long%>%print(n=30) # here is there any correspondence
## # A tibble: 24 x 4
## player negative positive sentiment_score
##
## 1 FIRST SERVANT 4 0 -1
## 2 OLD MAN 5 0 -1
## 3 DOCTOR 11 1 -0.833
## 4 SECOND SERVANT 3 1 -0.5
## 5 EDGAR 207 86 -0.413
## 6 OSWALD 30 13 -0.395
## 7 LEAR 316 142 -0.380
## 8 GLOUCESTER 139 68 -0.343
## 9 MESSENGER 6 3 -0.333
## 10 THIRD SERVANT 2 1 -0.333
## 11 FOOL 66 36 -0.294
## 12 GENTLEMAN 25 14 -0.282
## 13 GONERIL 68 42 -0.236
## 14 REGAN 56 35 -0.231
## 15 KENT 138 88 -0.221
## 16 ALBANY 51 35 -0.186
## 17 CORNWALL 36 26 -0.161
## 18 EDMUND 104 80 -0.130
## 19 CORDELIA 37 33 -0.0571
## 20 HERALD 1 1 0
## 21 FRANCE 13 14 0.0370
## 22 KNIGHT 3 4 0.143
## 23 BURGUNDY 2 3 0.2
## 24 CAPTAIN 0 1 1
# between the measure and the play;
# consider sample size = 1 for CAPTAIN!
wordcounts <- King_Lear_tidy %>% # how many words are spoken by each player
count(player,sort=TRUE)
names(wordcounts)[2] <- "total_words" # use clear variable names
wordcounts # I use this and left_join below
## player total_words
## 1 LEAR 2044
## 2 EDGAR 1253
## 3 KENT 969
## 4 GLOUCESTER 901
## 5 EDMUND 854
## 6 FOOL 591
## 7 GONERIL 516
## 8 REGAN 443
## 9 ALBANY 367
## 10 CORDELIA 316
## 11 CORNWALL 285
## 12 GENTLEMAN 223
## 13 OSWALD 181
## 14 FRANCE 89
15
## 15 MESSENGER 54
## 16 DOCTOR 45
## 17 KNIGHT 34
## 18 BURGUNDY 26
## 19 CURAN 25
## 20 FIRST SERVANT 21
## 21 OLD MAN 19
## 22 HERALD 18
## 23 THIRD SERVANT 12
## 24 CAPTAIN 9
## 25 SECOND SERVANT 8
## 26 ALBANY/CORNWALL 3
# calc ratio of negative to total words per player
King_Lear_tidy %>%
semi_join(bingnegative) %>%
group_by(player) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("player"))%>%
mutate(ratio = negativewords/total_words) %>%
arrange(ratio) %>%
ungroup()%>% print(n=24)
## Joining, by = "word"
## # A tibble: 23 x 4
## player negativewords total_words ratio
##
## 1 HERALD 1 18 0.0556
## 2 BURGUNDY 2 26 0.0769
## 3 KNIGHT 3 34 0.0882
## 4 MESSENGER 6 54 0.111
## 5 FOOL 66 591 0.112
## 6 GENTLEMAN 25 223 0.112
## 7 CORDELIA 37 316 0.117
## 8 EDMUND 104 854 0.122
## 9 CORNWALL 36 285 0.126
## 10 REGAN 56 443 0.126
## 11 GONERIL 68 516 0.132
## 12 ALBANY 51 367 0.139
## 13 KENT 138 969 0.142
## 14 FRANCE 13 89 0.146
## 15 GLOUCESTER 139 901 0.154
## 16 LEAR 316 2044 0.155
## 17 EDGAR 207 1253 0.165
## 18 OSWALD 30 181 0.166
## 19 THIRD SERVANT 2 12 0.167
## 20 FIRST SERVANT 4 21 0.190
## 21 DOCTOR 11 45 0.244
## 22 OLD MAN 5 19 0.263
## 23 SECOND SERVANT 3 8 0.375
# now I compare the modern word list with Lear, first words in common
King_Lear_tidy_modern_words <- King_Lear_tidy%>%count(player,word)%>%
inner_join(modern_words,"word")
names(King_Lear_tidy_modern_words)[3] <- "times_spoken"
16
King_Lear_tidy_modern_words%>%arrange(desc(word_count))%>%head(50)
## player word times_spoken word_count
## 1 EDMUND home 1 1276852170
## 2 GLOUCESTER home 1 1276852170
## 3 GONERIL home 1 1276852170
## 4 KENT home 2 1276852170
## 5 LEAR home 2 1276852170
## 6 REGAN home 3 1276852170
## 7 CORDELIA search 1 1024093118
## 8 EDGAR free 3 1014107316
## 9 LEAR free 2 1014107316
## 10 ALBANY time 2 908705570
## 11 CORDELIA time 1 908705570
## 12 CURAN time 1 908705570
## 13 DOCTOR time 1 908705570
## 14 EDGAR time 4 908705570
## 15 EDMUND time 5 908705570
## 16 FOOL time 3 908705570
## 17 FRANCE time 1 908705570
## 18 GLOUCESTER time 2 908705570
## 19 GONERIL time 1 908705570
## 20 KENT time 5 908705570
## 21 LEAR time 2 908705570
## 22 REGAN time 1 908705570
## 23 ALBANY news 1 755424983
## 24 CORNWALL news 1 755424983
## 25 CURAN news 1 755424983
## 26 EDMUND news 1 755424983
## 27 GLOUCESTER news 1 755424983
## 28 GONERIL news 1 755424983
## 29 LEAR news 1 755424983
## 30 MESSENGER news 1 755424983
## 31 ALBANY business 2 637134177
## 32 CORDELIA business 1 637134177
## 33 EDGAR business 2 637134177
## 34 EDMUND business 5 637134177
## 35 GLOUCESTER business 2 637134177
## 36 LEAR business 1 637134177
## 37 OSWALD business 1 637134177
## 38 EDGAR web 1 619571575
## 39 EDMUND view 1 602279334
## 40 EDMUND services 2 562206804
## 41 GONERIL services 2 562206804
## 42 LEAR services 1 562206804
## 43 EDGAR service 1 519537222
## 44 FIRST SERVANT service 1 519537222
## 45 KENT service 2 519537222
## 46 LEAR service 2 519537222
## 47 OSWALD service 1 519537222
## 48 LEAR price 1 501651226
## 49 EDGAR top 1 484213771
## 50 EDMUND top 1 484213771
17
# removal of custom stop words in King_Lear_tidy_modern_words
King_Lear_tidy_modern_words <- King_Lear_tidy_modern_words %>%
anti_join(stop_words)
## Joining, by = "word"
King_Lear_tidy_modern_words%>%arrange(desc(word_count))%>%head(100)
## player word times_spoken word_count
## 1 EDMUND home 1 1276852170
## 2 GLOUCESTER home 1 1276852170
## 3 GONERIL home 1 1276852170
## 4 KENT home 2 1276852170
## 5 LEAR home 2 1276852170
## 6 REGAN home 3 1276852170
## 7 CORDELIA search 1 1024093118
## 8 EDGAR free 3 1014107316
## 9 LEAR free 2 1014107316
## 10 ALBANY time 2 908705570
## 11 CORDELIA time 1 908705570
## 12 CURAN time 1 908705570
## 13 DOCTOR time 1 908705570
## 14 EDGAR time 4 908705570
## 15 EDMUND time 5 908705570
## 16 FOOL time 3 908705570
## 17 FRANCE time 1 908705570
## 18 GLOUCESTER time 2 908705570
## 19 GONERIL time 1 908705570
## 20 KENT time 5 908705570
## 21 LEAR time 2 908705570
## 22 REGAN time 1 908705570
## 23 ALBANY news 1 755424983
## 24 CORNWALL news 1 755424983
## 25 CURAN news 1 755424983
## 26 EDMUND news 1 755424983
## 27 GLOUCESTER news 1 755424983
## 28 GONERIL news 1 755424983
## 29 LEAR news 1 755424983
## 30 MESSENGER news 1 755424983
## 31 ALBANY business 2 637134177
## 32 CORDELIA business 1 637134177
## 33 EDGAR business 2 637134177
## 34 EDMUND business 5 637134177
## 35 GLOUCESTER business 2 637134177
## 36 LEAR business 1 637134177
## 37 OSWALD business 1 637134177
## 38 EDGAR web 1 619571575
## 39 EDMUND view 1 602279334
## 40 EDMUND services 2 562206804
## 41 GONERIL services 2 562206804
## 42 LEAR services 1 562206804
## 43 EDGAR service 1 519537222
## 44 FIRST SERVANT service 1 519537222
## 45 KENT service 2 519537222
18
## 46 LEAR service 2 519537222
## 47 OSWALD service 1 519537222
## 48 LEAR price 1 501651226
## 49 EDGAR top 1 484213771
## 50 EDMUND top 1 484213771
## 51 GLOUCESTER top 1 484213771
## 52 LEAR top 1 484213771
## 53 GENTLEMAN people 1 480303376
## 54 GONERIL people 1 480303376
## 55 LEAR people 1 480303376
## 56 REGAN people 2 480303376
## 57 EDGAR list 1 472590641
## 58 REGAN list 1 472590641
## 59 EDMUND day 1 446236148
## 60 FOOL day 1 446236148
## 61 GLOUCESTER day 1 446236148
## 62 GONERIL day 1 446236148
## 63 LEAR day 2 446236148
## 64 FOOL health 1 440416431
## 65 GONERIL health 1 440416431
## 66 LEAR health 1 440416431
## 67 EDGAR world 4 431934249
## 68 EDMUND world 3 431934249
## 69 GENTLEMAN world 1 431934249
## 70 GLOUCESTER world 5 431934249
## 71 KENT world 1 431934249
## 72 LEAR world 5 431934249
## 73 REGAN world 1 431934249
## 74 DOCTOR music 1 414028837
## 75 FOOL buy 1 410780176
## 76 FRANCE buy 1 410780176
## 77 CORNWALL post 1 392956436
## 78 EDGAR post 1 392956436
## 79 KENT post 1 392956436
## 80 GONERIL add 1 387231739
## 81 GLOUCESTER policy 1 384401868
## 82 EDGAR support 1 373512569
## 83 LEAR support 1 373512569
## 84 OSWALD support 1 373512569
## 85 KENT message 1 373081242
## 86 ALBANY rights 1 352051342
## 87 KENT rights 1 352051342
## 88 REGAN rights 1 352051342
## 89 EDGAR books 1 347710184
## 90 FOOL school 1 343057316
## 91 CORNWALL company 1 324272258
## 92 GLOUCESTER company 1 324272258
## 93 GONERIL company 1 324272258
## 94 LEAR company 1 324272258
## 95 ALBANY read 3 322331766
## 96 EDMUND read 1 322331766
## 97 GENTLEMAN read 1 322331766
## 98 GONERIL read 1 322331766
## 99 KENT read 1 322331766
19
## 100 LEAR read 2 322331766
nrow(King_Lear_tidy_modern_words%>%count(word))/
nrow(King_Lear_tidy%>%count(word)) # percent of modern words in the play
## [1] 0.9
# now looking at old words by excluding modern words with anti_join
King_Lear_tidy_old_words <- King_Lear_tidy%>%anti_join(modern_words,"word")
King_Lear_tidy_old_words%>%count(word)%>%arrange(desc(n))%>%head(30)
## word n
## 1 nuncle 17
## 2 man's 15
## 3 father's 9
## 4 o'er 9
## 5 tom's 7
## 6 know'st 6
## 7 brother's 5
## 8 whoreson 5
## 9 e'er 4
## 10 gloucester's 4
## 11 lady's 4
## 12 ne'er 4
## 13 one's 4
## 14 sister's 4
## 15 coxcombs 3
## 16 duke's 3
## 17 fortune's 3
## 18 gav'st 3
## 19 ingrateful 3
## 20 think'st 3
## 21 treasons 3
## 22 woman's 3
## 23 albany's 2
## 24 all's 2
## 25 aroint 2
## 26 contemned 2
## 27 cornwall's 2
## 28 day's 2
## 29 dowerless 2
## 30 dowers 2
# remove apostrophe casesusing stringr package commands
plural <- King_Lear_tidy_old_words$word%>%str_detect("'s")
appos <- King_Lear_tidy_old_words$word%>%str_detect("'")
table(plural,appos) # eg as expected plural always with apostrophe
## appos
## plural FALSE TRUE
## FALSE 272 42
## TRUE 0 179
# now I want to remove the ' and 's from the list
King_Lear_tidy_old_words <- King_Lear_tidy_old_words[(!plural)&(!appos),]
King_Lear_tidy_old_words%>%count(word)%>%arrange(desc(n))%>%head(30)
20
## word n
## 1 nuncle 17
## 2 whoreson 5
## 3 coxcombs 3
## 4 ingrateful 3
## 5 treasons 3
## 6 aroint 2
## 7 contemned 2
## 8 dowerless 2
## 9 dowers 2
## 10 houseless 2
## 11 pillicock 2
## 12 revenges 2
## 13 saucily 2
## 14 sayst 2
## 15 scanted 2
## 16 aidant 1
## 17 alarumed 1
## 18 amities 1
## 19 anatomize 1
## 20 apish 1
## 21 arbitrament 1
## 22 asquint 1
## 23 attaint 1
## 24 atwain 1
## 25 avouch 1
## 26 avouched 1
## 27 awork 1
## 28 ballow 1
## 29 balmed 1
## 30 barbermonger 1
nrow(King_Lear_tidy_old_words)/nrow(King_Lear_tidy%>%count(word))
## [1] 0.07431694
#proportion of old words is now 7.4 %
library(wordcloud) # a word cloud is just a nice graphic, not an analysis
## Warning: package 'wordcloud' was built under R version 4.2.2
## Loading required package: RColorBrewer
print(King_Lear_tidy%>%
anti_join(stop_words) %>%
count(word)%>%
with(wordcloud(word, n, scale=c(1.,.5),random.order = FALSE,
rot.per = 0, max.words = 50)))
## Joining, by = "word"
21
sir
lord
kingfather
hathlove
speak
poor
heart
fool
art
eyes
life
nature
madam
edmund
sister
night
daughtertime
daughters
sondost
fellow
gods
hand
hear
pray
master
thine
letter
gloucester
dear
death
duke
france
hast
mine
ay
call
cordelia
grace
hold
stand
brother
head
lady
villain
world
noble
## NULL
# another nrc sentiment library use
nrc <- get_sentiments("nrc")
table(nrc$sentiment) # nrc has 10 sentiment labels for words
##
## anger anticipation disgust fear joy
## 1245 837 1056 1474 687
## negative positive sadness surprise trust
## 3316 2308 1187 532 1230
nrc_labels <- King_Lear_tidy %>%inner_join(nrc) # sentiments by player and word
## Joining, by = "word"
nrc_labels%>%head(30)
## player word sentiment
## 1 KENT king positive
## 2 KENT duke positive
## 3 GLOUCESTER curiosity anticipation
## 4 GLOUCESTER curiosity positive
## 5 GLOUCESTER curiosity surprise
## 6 GLOUCESTER choice positive
## 7 KENT lord disgust
## 8 KENT lord negative
## 9 KENT lord positive
22
## 10 KENT lord trust
## 11 GLOUCESTER sir positive
## 12 GLOUCESTER sir trust
## 13 GLOUCESTER sir positive
## 14 GLOUCESTER sir trust
## 15 GLOUCESTER mother anticipation
## 16 GLOUCESTER mother joy
## 17 GLOUCESTER mother negative
## 18 GLOUCESTER mother positive
## 19 GLOUCESTER mother sadness
## 20 GLOUCESTER mother trust
## 21 GLOUCESTER sir positive
## 22 GLOUCESTER sir trust
## 23 GLOUCESTER cradle anticipation
## 24 GLOUCESTER cradle joy
## 25 GLOUCESTER cradle positive
## 26 GLOUCESTER cradle trust
## 27 GLOUCESTER smell anger
## 28 GLOUCESTER smell disgust
## 29 GLOUCESTER smell negative
## 30 GLOUCESTER fault negative
table(nrc_labels$sentiment) # sentiments in the play as a whole
##
## anger anticipation disgust fear joy
## 613 543 648 744 560
## negative positive sadness surprise trust
## 1396 1626 642 323 1043
# collect the data for Lear and Kent to compare these characters
lear <- nrc_labels%>%filter(player=="LEAR")%>%group_by(sentiment)%>%
count()%>%pull(n)
kent <- nrc_labels%>%filter(player=="KENT")%>%group_by(sentiment) %>%
count()%>%pull(n)
print(plot(lear~kent)) # crude plot for quick visualisation
23
50 100 150 200
10
0
15
0
20
0
25
0
30
0
kent
le
ar
## NULL
sentiments <- nrc_labels%>%filter(player=="LEAR")%>%group_by(sentiment)%>%
count()%>%pull(sentiment)
sentiments # same as the list in nrc as a whole
## [1] "anger" "anticipation" "disgust" "fear"
## [5] "joy" "negative" "positive" "sadness"
## [9] "surprise" "trust"
n_kent <- sum(kent) # for normalization purposes
n_lear <- sum(lear)
n_kent;n_lear
## [1] 874
## [1] 1712
print(barplot(lear))
24
0
50
10
0
15
0
20
0
25
0
30
0
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
## [8,] 9.1
## [9,] 10.3
## [10,] 11.5
print(barplot(kent)) # looking for differences between these two
25
0
50
10
0
15
0
20
0
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
## [8,] 9.1
## [9,] 10.3
## [10,] 11.5
probs_lear <- lear/n_lear # these are the probabilities for Lear's sentiments
sum(probs_lear) # check normalization i.e. the sum is 1
## [1] 1
# the follow lines and similar below are an example of
# a "Bootstrap" method which is essentially re-sampling
# with replacement , the objectives to randomly build a distribution
# like the kent distribution using re-sampling based on lear probabilities.
boot_sentiments <- sample(sentiments,size=n_kent,
replace = TRUE,prob = probs_lear)%>%
table()%>%as.numeric()
boot_sentiments
## [1] 56 56 78 69 67 152 170 93 37 96
26
print(barplot(as.matrix(cbind(kent,boot_sentiments)),beside = TRUE))
kent boot_sentiments
0
50
10
0
15
0
20
0
## [,1] [,2]
## [1,] 1.5 12.5
## [2,] 2.5 13.5
## [3,] 3.5 14.5
## [4,] 4.5 15.5
## [5,] 5.5 16.5
## [6,] 6.5 17.5
## [7,] 7.5 18.5
## [8,] 8.5 19.5
## [9,] 9.5 20.5
## [10,] 10.5 21.5
# kent counts based on lear proportions/probabilities
exp_kent <- probs_lear*n_kent
print(plot(exp_kent,kent))
## NULL
abline(lm(exp_kent~kent)) # fit a best fit line using lm()
27
40 60 80 100 120 140 160
50
10
0
15
0
20
0
exp_kent
ke
n
t
# the chisq test tests for any significance
# difference in a set of categorical distributions
# in this case kent and lear
kent;lear
## [1] 64 44 72 69 47 150 210 63 31 124
## [1] 138 115 134 162 138 306 307 168 80 164
observed_table <- cbind(kent,lear) # create table and rename the cols
colnames(observed_table) <- c('kent', 'lear')
observed_table
## kent lear
## [1,] 64 138
## [2,] 44 115
## [3,] 72 134
## [4,] 69 162
## [5,] 47 138
## [6,] 150 306
## [7,] 210 307
## [8,] 63 168
## [9,] 31 80
## [10,] 124 164
chisq.test(observed_table) # here its the p value to look for
##
## Pearson's Chi-squared test
28
##
## data: observed_table
## X-squared = 38.662, df = 9, p-value = 1.325e-05
# p-value = 1.325e-05 , this very small p va;ue indicates that
# there is strong evidence against kent and lear being the essentially
# the same distribution apart from random variation
# now I do the same for fool and lear which from the play one
# might think are more similar
fool <- nrc_labels%>%filter(player=="FOOL")%>%group_by(sentiment)%>%
count()%>%pull(n)
n_fool <- sum(fool)
n_fool;n_lear # note the fool has only a 1/4 of lears sample size
## [1] 404
## [1] 1712
print(barplot(lear))
0
50
10
0
15
0
20
0
25
0
30
0
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
29
## [7,] 7.9
## [8,] 9.1
## [9,] 10.3
## [10,] 11.5
print(barplot(fool))
0
20
40
60
80
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
## [8,] 9.1
## [9,] 10.3
## [10,] 11.5
# bootstrap as above try repeating these bootstrap code segments
# to see if one can get a fool like distribution with lear probabilities
boot_sentiments <- sample(sentiments,size=n_fool,
replace = TRUE,prob = probs_lear)%>%
table()%>%as.numeric()
print(barplot(as.matrix(cbind(fool,boot_sentiments)),beside = TRUE))
30
fool boot_sentiments
0
20
40
60
80
## [,1] [,2]
## [1,] 1.5 12.5
## [2,] 2.5 13.5
## [3,] 3.5 14.5
## [4,] 4.5 15.5
## [5,] 5.5 16.5
## [6,] 6.5 17.5
## [7,] 7.5 18.5
## [8,] 8.5 19.5
## [9,] 9.5 20.5
## [10,] 10.5 21.5
# a chisq test can tell us if there is a signicant distribution
observed_table <- cbind(fool,lear)
colnames(observed_table) <- c('fool', 'lear')
observed_table # this is the table to be tested for independence
## fool lear
## [1,] 20 138
## [2,] 28 115
## [3,] 41 134
## [4,] 24 162
## [5,] 33 138
## [6,] 79 306
## [7,] 84 307
## [8,] 28 168
## [9,] 24 80
31
## [10,] 43 164
chisq.test(observed_table) # p-value = 0.04369
##
## Pearson's Chi-squared test
##
## data: observed_table
## X-squared = 17.337, df = 9, p-value = 0.04369
# this p value is only just significant at 5% and is
# not significant at 1% (0.044>0.01) which suggests
# that there is no evidence for a difference between the
# fool and lear as was suspected from the play itself.
# the following attempts to create a metric or measure of similarity
# between players
lear_words <- King_Lear_tidy%>%filter(player=="LEAR")%>%count(word)
fool_words <- King_Lear_tidy%>%filter(player=="FOOL")%>%count(word)
head(lear_words);head(fool_words)
## word n
## 1 abated 1
## 2 abjure 1
## 3 abode 1
## 4 abuse 1
## 5 abused 1
## 6 accuser's 1
## word n
## 1 albion 1
## 2 alive 1
## 3 ant 1
## 4 apish 1
## 5 apple 1
## 6 arrant 1
sum_words_lear <- sum(lear_words$n)
sum_words_fool <- sum(fool_words$n)
# look at the words in common lear and fool
fool_common <- fool_words%>%inner_join(lear_words,by="word")
head(fool_common)
## word n.x n.y
## 1 art 3 18
## 2 banished 1 1
## 3 bear 1 5
## 4 bed 1 1
## 5 beg 1 2
## 6 beggars 1 1
sum_fool_common <- sum(fool_common[,2:3]) # adding from both dataframes
sum_words_lear;sum_words_fool;sum_fool_common
## [1] 2044
## [1] 591
32
## [1] 784
# define my own arbitrary metric
metric <- sum_fool_common/(sum_words_fool+sum_words_lear)
metric
## [1] 0.2975332
# I want to put this into a general function to compare all players
player_match <- function(PA,PB){
PA_words <- King_Lear_tidy%>%filter(player==PA)%>%count(word)
PB_words <- King_Lear_tidy%>%filter(player==PB)%>%count(word)
sum_words_A <- sum(PA_words$n)
sum_words_B <- sum(PB_words$n)
PA_common <- PA_words%>%inner_join(PB_words,by="word")
sum_A_common <- sum(PA_common[,2:3])
metric <- sum_A_common/(sum_words_A+sum_words_B)
return(metric)}
player_match("FOOL","LEAR") # testing same as above
## [1] 0.2975332
player_match("KENT","LEAR")
## [1] 0.3873216
player_match("EDGAR","LEAR")
## [1] 0.3742796
main_players <- King_Lear_tidy %>%count(player,sort=TRUE)%>%filter(n>300)%>%
pull(player)
main_players # focus only on main players , avoid small sample sizes
## [1] "LEAR" "EDGAR" "KENT" "GLOUCESTER" "EDMUND"
## [6] "FOOL" "GONERIL" "REGAN" "ALBANY" "CORDELIA"
metrics <- as.data.frame(NULL,col.names=c("player1","player2","metric"))
metrics # just a blank data frame to collect results below
## data frame with 0 columns and 0 rows
for(i in 1:9){
for(j in (i+1):10){
metrics <- rbind(metrics,c(main_players[i],main_players[j],
player_match(main_players[i],main_players[j])))
}
}
names(metrics) <- c("player1","player2","metric")
metrics$metric <- as.numeric(metrics$metric ) # coerce to numeric
head(metrics)
## player1 player2 metric
## 1 LEAR EDGAR 0.3742796
## 2 LEAR KENT 0.3873216
## 3 LEAR GLOUCESTER 0.3969440
## 4 LEAR EDMUND 0.3388544
## 5 LEAR FOOL 0.2975332
33
## 6 LEAR GONERIL 0.2500000
metrics%>%arrange(metric) # so who is like whom and does it
## player1 player2 metric
## 1 FOOL ALBANY 0.1711900
## 2 FOOL CORDELIA 0.1742007
## 3 FOOL REGAN 0.1779497
## 4 EDMUND FOOL 0.2062284
## 5 EDGAR CORDELIA 0.2154238
## 6 EDGAR REGAN 0.2169811
## 7 EDMUND CORDELIA 0.2324786
## 8 FOOL GONERIL 0.2330623
## 9 GONERIL CORDELIA 0.2331731
## 10 LEAR CORDELIA 0.2343220
## 11 ALBANY CORDELIA 0.2357247
## 12 EDGAR GONERIL 0.2357264
## 13 LEAR REGAN 0.2360273
## 14 EDGAR ALBANY 0.2370370
## 15 GLOUCESTER FOOL 0.2446381
## 16 EDGAR FOOL 0.2472885
## 17 LEAR ALBANY 0.2488594
## 18 LEAR GONERIL 0.2500000
## 19 GONERIL ALBANY 0.2536806
## 20 KENT ALBANY 0.2544910
## 21 KENT CORDELIA 0.2560311
## 22 GLOUCESTER CORDELIA 0.2645850
## 23 REGAN ALBANY 0.2666667
## 24 REGAN CORDELIA 0.2687747
## 25 EDMUND ALBANY 0.2784603
## 26 GLOUCESTER GONERIL 0.2822865
## 27 KENT GONERIL 0.2835017
## 28 EDGAR EDMUND 0.2857143
## 29 KENT REGAN 0.2932011
## 30 KENT FOOL 0.2948718
## 31 LEAR FOOL 0.2975332
## 32 EDGAR KENT 0.2988299
## 33 GLOUCESTER ALBANY 0.3028391
## 34 EDMUND GONERIL 0.3029197
## 35 EDMUND REGAN 0.3068620
## 36 KENT EDMUND 0.3274822
## 37 GONERIL REGAN 0.3305527
## 38 LEAR EDMUND 0.3388544
## 39 GLOUCESTER REGAN 0.3444940
## 40 EDGAR GLOUCESTER 0.3579387
## 41 GLOUCESTER EDMUND 0.3709402
## 42 LEAR EDGAR 0.3742796
## 43 KENT GLOUCESTER 0.3743316
## 44 LEAR KENT 0.3873216
## 45 LEAR GLOUCESTER 0.3969440
# correspond the the players role in the play in some general way?
# finally see if lear is closest to the fool by this method
metrics%>%filter(player1=="LEAR"|player2=="LEAR")
34
## player1 player2 metric
## 1 LEAR EDGAR 0.3742796
## 2 LEAR KENT 0.3873216
## 3 LEAR GLOUCESTER 0.3969440
## 4 LEAR EDMUND 0.3388544
## 5 LEAR FOOL 0.2975332
## 6 LEAR GONERIL 0.2500000
## 7 LEAR REGAN 0.2360273
## 8 LEAR ALBANY 0.2488594
## 9 LEAR CORDELIA 0.2343220
# so the answer is not really and this measure has a limited range !
# However this is fine for the assignment as we acknowledge the
# very limited nature of the analysis , limited in which ways ?