python代写-COMP42415

时间：2022-03-15

Durham University
Department of Computer Science
Master of Data Science
COMP42415
Text Mining and Language Analytics
Workshops
Author
Dr Stamos Katsigiannis
2021-22

Contents
1 Workshop 1: Text query with Regular Expressions 1
1.1 Regular Expressions Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The “re” Python package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 RegEx functions in “re” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.3 Regular expression metacharacters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.4 Regular expression special sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.5 Repeating regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Using Regular Expressions to match strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Matching example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Matching to validate strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Credit card number validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Validation of United Kingdom’s National Insurance numbers (NINO) . . . . . . . . . . . 7
1.3.5 Validation of hexadecimal numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Using Regular Expressions to search elements in files . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Parsing XML files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Parsing HTML files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.3 Parsing raw text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Using Regular Expressions to substitute strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 Substitution example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.2 Email domain substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Workshop 2: Text pre-processing 15
2.1 NLTK corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Import NLTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Corpus file IDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Corpus file categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 Corpus words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.5 Selecting files from specific category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.6 Corpus sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.7 Accessing sentences from specific file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.8 Number of documents in each category . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.9 Corpus raw text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Input text from text file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Text pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Sentence Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Word Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Lowercasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Stop words removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.5 Punctuation removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.6 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.7 Lemmatisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.8 Part of Speech (POS) tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iii
iv Contents
3 Workshop 3: Text representation 27
3.1 Load corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Words in corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Unique words in corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Vocabulary of multiple documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 One-hot encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 One-hot encoding of words in vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 One-hot encoding of text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Term Frequency (TF) representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Compute TF of words in text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 TF representation of documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Term Frequency - Inverse Document Frequency (TF-IDF) . . . . . . . . . . . . . . . . . . . . . . 33
3.5.1 Document Frequency (DF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.2 Inverse Document Frequency (IDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.3 Term Frequency - Inverse Document Frequency (TF-IDF) . . . . . . . . . . . . . . . . . . 34
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Workshop 4: N-Grams 37
4.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Unigrams (1-Grams) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Compute unigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 Unigram probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Sentence probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Bigrams (2-Grams) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Compute bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Bigram probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.3 Sentence probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Trigrams (3-Grams) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 Compute trigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.2 Trigram probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.3 Sentence probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 The number underflow issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Workshop 5: Word embeddings 49
5.1 Word context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.1 Load text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.2 Compute context words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.3 Other contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Word-word co-occurrence matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Word-word co-occurrence matrix (Context size = 1) . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Word-word co-occurrence matrix (Context size = 2) . . . . . . . . . . . . . . . . . . . . . 51
5.2.3 Compute word-word co-occurrence matrix as numpy array . . . . . . . . . . . . . . . . . . 52
5.2.4 Word-word co-occurrence matrix visualisation . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1 Word embeddings computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 Word embeddings visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.3 Word embeddings distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6 Workshop 6: Document embeddings for machine learning 57
6.1 Loading data with pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1.1 Load dataset file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1.2 Available words in dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.3 Access specific row in pandas dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.4 Access specific columns in pandas dataframe . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.5 Convert dataframe to numpy array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
v6.2 Word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.1 Load word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.2 Distance of word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Document embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.1 Compute document embedding (Mean word embedding) . . . . . . . . . . . . . . . . . . . 62
6.3.2 Cosine distance of document embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4 Classification using document embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4.1 Classification with the k Nearest Neighbour algorithm (kNN) . . . . . . . . . . . . . . . . 65
6.4.2 Classification with Linear Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . 66
6.4.3 Classification using Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7 Workshop 7: Text Classification Using Traditional Classifiers 69
7.1 Introduction to Python classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1.1 A simple Python class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1.2 Definition of class methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1.3 Class initialisation and method overloading . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.1.4 Definition of a custom class for text documents . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2 Preparation of spam detection dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2.1 Dataset loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2.2 Word tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2.4 Join email words list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3 Text classification for spam detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3.1 Splitting of dataset into training and test sets . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3.2 Text classification using Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.3 Computation and plotting of Naive Bayes’s classification performance . . . . . . . . . . . 76
7.3.4 Text classification using kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3.5 Computation and plotting of kNN’s classification performance . . . . . . . . . . . . . . . 77
7.4 Saving and loading a trained machine learning model . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4.1 Save trained model in a file for future use . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4.2 Loading and use of saved trained model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8 Workshop 8: Text classification using Recurrent Neural Networks (RNNs) 81
8.1 Dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.1.1 Load fake news dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.1.2 Dataset pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.1.3 Create PyTorch dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.1.4 Divide dataset into training and test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1.5 Create vocabulary using the training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1.6 Create iterators for the training and test data . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2 Create LSTM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2.1 Define network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2.2 Define hyperparameters and initialise model . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2.3 Define the optimiser, loss function and performance metric . . . . . . . . . . . . . . . . . 86
8.2.4 Define training function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.2.5 Define evaluation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3 Train LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.4 Classify text using trained model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A Appendix: Test files 89
A.1 movies.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.2 links.html . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.3 emails.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.4 cds.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.5 alice.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.6 dune.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
vi Contents
Workshop 1: Text query with Regular
Expressions
1.1 Regular Expressions Definition
A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern.
Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on
strings, or for input validation.
1.2 The “re” Python package
Python has a built-in package called “re”, which can be used to work with Regular Expressions. Let’s import
this package and use a regular expression to check whether the sentence “I started studying this year at Durham
University” ends with the word “University” or with the word “school”.
1.2.1 Example
import re # Import the re package
txt = "I started studying this year at Durham University"
x1 = re.search("University$", txt) # Returns a Match object if there is a match anywhere in the
string with the regex
x2 = re.search("school$", txt)
print("x1:",x1)
print("x2:",x2)
if(x1):
print("The text ends with 'University'")
else:
print("The text does not end with 'University'")
if(x2):
print("The text ends with 'school'")
else:
print("The text does not end with 'school'")
The output will look like:
x1: <_sre.SRE_Match object; span=(39, 49), match='University'>
x2: None
The text ends with 'University'
The text does not end with 'school'
1.2.2 RegEx functions in “re”
The re module offers a set of functions that allows us to search a string for a match:
1
2 Workshop 1: Text query with Regular Expressions
Function Description
findall(args) Returns a list containing all matches
search(args) Returns a Match object if there is a match anywhere in the string
split(args) Returns a list where the string has been split at each match
sub(args) Replaces one or many matches with a string
1.2.3 Regular expression metacharacters
As you can see in our first example, we used the character “$” in order to indicate that a text matching the
regular expression should end with the string preceding the character “$”. For example, the regular expression
“car$” indicates that the text should and with the string “car”. In this case, “$” is considered as a metacharacter,
i.e. a character with a special meaning. Below are the metacharacters supported by the “re” package:
Character Description Example
[ ] A set of characters “[a-f]”
\ Signals a special sequence (can also be used to
escape special characters)
“\s”
. Any character (except newline character) “uni..rsity”
ˆ Starts with “ˆShe”
$ Ends with “John$”
* Zero or more occurrences “o*”
+ One or more occurrences “l+”
? Matches 0 or 1 repetitions of the preceding
regex
ab? will match ei-
ther “a” or “ab”
{} Exactly the specified number of occurrences “o{2}”
| Either or “he|she”
() Capture and group
The first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set
of characters that you wish to match. Characters can be listed individually, or a range of characters can be
indicated by giving two characters and separating them by a “-”. For example, [abc] will match any of the
characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you
wanted to match only lowercase letters, your regex would be [a-z].
Metacharacters are not active inside classes. For example, [akm$] will match any of the characters “a”, “k”,
“m”, or “$”. “$” is usually a metacharacter, but inside a character class it is stripped of its special nature.
You can match the characters not listed within the class by complementing the set. This is indicated by
including a “ˆ” as the first character of the class. For example, [ˆ5] will match any character except “5”. If the
caret appears elsewhere in a character class, it does not have special meaning. For example: [5ˆ] will match
either a “5” or a “ˆ”.
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be
followed by various characters to signal various special sequences. It is also used to escape all the metacharacters
so you can still match them in patterns. For example, if you need to match a [ or \, you can precede them with
a backslash to remove their special meaning: \[ or \\.
Some of the special sequences beginning with “\” represent predefined sets of characters that are often useful,
such as the set of digits, the set of letters, or the set of anything that isn’t a white space.
1.2.4 Regular expression special sequences
Let’s see some of the main regular expression special sequences. For a more detailed list, please refer to
https://docs.python.org/3/library/re.html#re-syntax.
These sequences can be included inside a character class. For example, [\s,.] is a character class that will match
any white space character, or “,” or “.”.
3Sequence Description
\d Matches any decimal digit. Equivalent to the class [0-9].
\D Matches any non-digit character. Equivalent to the class [ˆ0-9].
\s Matches any white space character. Equivalent to the class [ \t\n\r\f\v].
\S Matches any non-white space character. Equivalent to the class [ˆ\t\n\r\f\v].
\w Matches any alphanumeric character. Equivalent to the class [a-zA-Z0-9 ].
\W Matches any non-alphanumeric character. Equivalent to the class [ˆa-zA-Z0-9 ].
1.2.5 Repeating regular expressions
Being able to match varying sets of characters is the first thing regular expressions can do that isn’t already
possible with the methods available on strings. However, if that was the only additional capability of regexes,
they wouldn’t be much of an advance. Another capability is that you can specify that portions of the regular
expression must be repeated a certain number of times.
The first metacharacter for repeating things that we’ll look at is “*”. “*” does not match the literal character
“*”, but it specifies that the previous character can be matched zero or more times, instead of exactly once.
Example: “do*g” will match “dg” (zero “o” characters), “dog” (one “o” character), “doooog” (four “o”
characters), and so on.
Repetitions such as “*”,“+”, and “?” are greedy. When repeating a regular expression, the matching engine
will try to repeat it as many times as possible. If later portions of the pattern don’t match, the matching engine
will then back up and try again with fewer repetitions. If this behaviour is undesirable, you can add “?” after
the qualifier (“*?”,“+?”, “??”) to make it perform the match in non-greedy or minimal fashion, i.e. as few
characters as possible will be matched.
Step-by-step example
Let’s consider the expression a[bcd]*b. This matches the letter “a”, zero or more letters from the class [bcd],
and finally ends with a “b”. Now imagine matching this regular expression against the string “abcbd”.
Step Matched Explanation
1 a The “a” in the regex matches.
2 abcbd The engine matches “[bcd]*”, going as far as it can, which is to the end of the string.
3 FAILED The engine tries to match “b”, but the current position is at the end of the string, so
it fails.
4 abcb Back up, so that “[bcd]*” matches one less character.
5 FAILED Try “b” again, but the current position is at the last character, which is a “d”.
6 abc Back up again, so that “[bcd]*” is only matching “bc”.
7 abcb Try “b” again. This time the character at the current position is “b”, so it succeeds.
The end of the regular expression has now been reached, and it has matched “abcb”. This demonstrates how
the matching engine goes as far as it can at first, and if no match is found it will then progressively back up
and retry the rest of the regular expression again and again. It will back up until it has tried zero matches for
“[bcd]*”, and if that subsequently fails, the engine will conclude that the string does not match the regex at all.
Another repeating metacharacter is “+”, which matches one or more times. Pay careful attention to the
difference between “*” and “+”. “*” matches zero or more times, so whatever’s being repeated may not be
present at all, while “+” requires at least one occurrence.
Example: “do+g” will match “dog” (one “o” character), “dooog” (three “o” characters), and so on, but will
not match “dg” (zero “o” characters).
There are two more repeating qualifiers. The question mark character “?” matches either once or zero times.
Think of it as marking something as being optional.
Example: “pre-?processing” matches either “preprocessing” or “pre-processing”.
4 Workshop 1: Text query with Regular Expressions
The most complicated repeated qualifier is “{m,n}”, where m and n are decimal integers. This qualifier means
there must be at least m repetitions, and at most n. For example, “a/{1,3}b” will match “a/b”, “a//b”, and
“a///b”, but it will not match “ab”, which has no slashes, or “a////b”, which has four slashes. You can omit
either m or n. In this case, default values for m or n are used. Omitting m is interpreted as a lower limit of 0,
while omitting n results in an upper bound of infinity.
Note: Some qualifiers are interchangeable. For example “{0,}” is the same as “*”, “{1,}” is the same as “+”,
and “{0,1}” is the same as “?”. “*”, “+”, and “?” make the regular expression easier to read, so try to use
them if possible.
1.3 Using Regular Expressions to match strings
1.3.1 Matching example
Let’s use the text “I started studying this year at Durham University” again and find out whether the string
“at” or the string “in” exists in the text.
txt = "I started studying this year at Durham University"
x = re.search("at|in", txt)
print(x.string) # Returns the string passed into the function
print(x.span()) # Returns a tuple containing the start, and end positions of the match
print(x.group()) # Returns the part of the string where there was a match
print(txt[x.span()[0]:x.span()[1]]) # Print the content of the string at the positions of the match
The output will look like:
I started studying this year at Durham University
(15, 17)
in
in
As you can see, there was a match to our regex at the character with index 15 (counting starts from 0), ending
at the character with index 17. Indeed, the string “in” was found within the word “studying”.
However, if you read the input text, there should have been a second match for the word “at” but only the first
match was returned. Note that if there is more than one match, only the first occurrence of the match will be
returned by the search() function! We can use the findall() function to get a list of all matches in the order
they are found.
x = re.findall("at|in", txt)
for match in x:
print(match)
The output will look like:
in
at
As expected, the findall() function returned two matches, “in” and “at”.
Consider the string “stp stop stoop stoooooop stooooooooooop”. Let’s find matches where the character “o”
appears one or more times and matches where the character “o” appears two times only.
x = re.findall("o+", "stp stop stoop stoooooop stooooooooooop")
print("One or more occurrences of 'o':")
for match in x:
print(match)
x = re.findall("o{2}", "stp stop stoop stoooooop stooooooooooop")
print("\nTwo occurrences of 'o':")
for match in x:
print(match)
5The output will look like:
One or more occurrences of 'o':
o
oo
oooooo
ooooooooooo
Two occurrences of 'o':
oo
oo
oo
oo
oo
oo
oo
oo
oo
Note that when looking for two occurrences of “o”, the findall() function returned 9 matches that correspond
to the 9 non-overlapping matches from the start (left) to the end (right) of the input text: “stp stop st(oo)p
st(oo)(oo)(oo)p st(oo)(oo)(oo)(oo)(oo)op”
1.3.2 Matching to validate strings
Let’s use regular expressions to check the validity of various strings. Consider an identification number that
should consist of exactly 10 digits (from 0 to 9), and the strings: “0123456789”, “12345”, “0000a00005”,
“+000001111”, “00000011115”, “2030405060”. How can we check whether these strings are valid identification
numbers?
text = list()
text.append("0123456789")
text.append("12345")
text.append("0000a00005")
text.append("+000001111")
text.append("00000011115")
text.append("2030405060")
regex = "[0-9]{10}"
result = list()
for t in text:
x = re.match(regex, t)
if(x != None):
print(t,"->",x.group())
else:
print(t,"-> No match")
result.append(x)
for i in range(len(text)):
if(result[i]!=None and result[i].group()==text[i]):
print(text[i],"is a valid identification number")
else:
print(text[i],"is NOT a valid identification number")
The output will look like:
0123456789 -> 0123456789
12345 -> No match
0000a00005 -> No match
+000001111 -> No match
00000011115 -> 0000001111
2030405060 -> 2030405060
0123456789 is a valid identification number
12345 is NOT a valid identification number
0000a00005 is NOT a valid identification number
6 Workshop 1: Text query with Regular Expressions
+000001111 is NOT a valid identification number
00000011115 is NOT a valid identification number
2030405060 is a valid identification number
Notice that string “00000011115” consists of 11 numerical digits, thus the regular expression matches the subset
“0000001111”. However, this is not a valid identification number according to the specification above. When
validating input, remember to check whether the matched string is equal to the query string.
1.3.3 Credit card number validation
Let’s try to validate whether the following strings are valid VISA or Mastercard credit card numbers: “1000000000000”,
“4000000000000”, “5000000000000”, “50000000000000000”, “50000a0000000c000”, “40123456789”. VISA credit
card numbers should start with a 4 and have 13 or 16 digits. Mastercard credit card numbers start with a 5
and have 16 digits.
text = list()
text.append("1000000000000") # 13 digits - Not valid
text.append("4000000000000") # 13 digits - Valid VISA
text.append("5000000000000") # 13 digits - Not valid
text.append("5000000000000000") # 16 digits - Valid Mastercard
text.append("50000a0000000c000") # Not valid, contains letters
text.append("40123456789") # 11 digits - Not valid
regex = "(5[0-9]{15})|(4([0-9]{12}|[0-9]{15}))" # Number 5 followed by 15 digits OR number 4 followed
by either 12 or 15 digits
result = list()
for t in text:
x = re.match(regex, t)
if(x != None):
print(t,"->",x.group())
else:
print(t,"-> No match")
result.append(x)
print("")
for i in range(len(text)):
if(result[i]!=None and result[i].group()==text[i]):
print(text[i],"is a valid VISA or Mastercard number")
else:
print(text[i],"is NOT a valid VISA or Mastercard number")
The output will look like:
1000000000000 -> No match
4000000000000 -> 4000000000000
5000000000000 -> No match
5000000000000000 -> 5000000000000000
50000a0000000c000 -> No match
40123456789 -> No match
1000000000000 is NOT a valid VISA or Mastercard number
4000000000000 is a valid VISA or Mastercard number
5000000000000 is NOT a valid VISA or Mastercard number
5000000000000000 is a valid VISA or Mastercard number
50000a0000000c000 is NOT a valid VISA or Mastercard number
40123456789 is NOT a valid VISA or Mastercard number
Let’s analyse the regex that we used. Both VISA and Mastercard numbers start with a specific digit but
Mastercard has exactly 16 digits in total, while VISA can have either 13 or 16 digits. Let’s first create a regex
for each case separately. For Mastercard, it should be “5[0-9]{15}”, the digit 5 followed by exactly 15 digits
(0-9), for a total of 16 digits. For VISA, it should be “4([0-9]{12}|[0-9]{15})”, the digit 4 followed by either
exactly 12 digits, for a total of 13 digits, or exactly 15 digits, for a total of 16 digits. Then, to include both the
VISA and the Mastercard cases in our final regex, we can enclose each regex within parentheses and combine
them with the OR (“|”) operator.
7Note that the expression [0-9] could be switched to [\d]:
regex = "(5[\d]{15})|(4([\d]{12}|[\d]{15}))" # Number 5 followed by 15 digits OR number 4 followed by
either 12 or 15 digits
result = list()
for t in text:
x = re.match(regex, t)
if(x != None):
print(t,"->",x.group())
else:
print(t,"-> No match")
result.append(x)
print("")
for i in range(len(text)):
if(result[i]!=None and result[i].group()==text[i]):
print(text[i],"is a valid VISA or Mastercard number")
else:
print(text[i],"is NOT a valid VISA or Mastercard number")
The output will look like:
1000000000000 -> No match
4000000000000 -> 4000000000000
5000000000000 -> No match
5000000000000000 -> 5000000000000000
50000a0000000c000 -> No match
40123456789 -> No match
1000000000000 is NOT a valid VISA or Mastercard number
4000000000000 is a valid VISA or Mastercard number
5000000000000 is NOT a valid VISA or Mastercard number
5000000000000000 is a valid VISA or Mastercard number
50000a0000000c000 is NOT a valid VISA or Mastercard number
40123456789 is NOT a valid VISA or Mastercard number
1.3.4 Validation of United Kingdom’s National Insurance numbers (NINO)
According to the rules for validating UK national insurance numbers1, a NINO is made up of 2 letters, 6
numbers and a final letter, which is always A, B, C, or D. It looks something like this: QQ 12 34 56 A. The
characters D, F, I, Q, U, and V are not used as either the first or second letter of a NINO prefix. The letter O
is not used as the second letter of a prefix.
Let’s create the required regex step-by-step and validate the following strings: “AA 123456 B”, “AO 123456 B”,
“AQ123456 B”, “QQ 123456 B”, “AA 123456 X”, “AA 12 34 56 B”, “AA123456B”, “AA12345B”, “A 123456
B”, “A 123456 B”. We must also take white spaces into consideration. Let’s consider the following two ways
of writing a NINO: AA123456A and AA 12345 A.
1. The first letter should be any of A, B, C, E, G, H, J, K, L, M, N, O, P, R, S, T, W, X, Y:
(A|B|C|E|G|H|J|K|L|M|N|O|P|R|S|T|W|X|Y)
2. The second letter should be any of A, B, C, E, G, H, J, K, L, M, N, P, R, S, T, W, X, Y:
(A|B|C|E|G|H|J|K|L|M|N|O|P|R|S|T|W|X|Y)(A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|W|X|Y)
3. The third letter can optionally be a white space character:
(A|B|C|E|G|H|J|K|L|M|N|O|P|R|S|T|W|X|Y)(A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|W|X|Y)[\s]?
4. Then, exactly 6 digits (0-9) are required:
(A|B|C|E|G|H|J|K|L|M|N|O|P|R|S|T|W|X|Y)(A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|W|X|Y)[\s]?[0-9]{6}
5. The next letter can optionally be a white space character:
(A|B|C|E|G|H|J|K|L|M|N|O|P|R|S|T|W|X|Y)(A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|W|X|Y)[\s]?[0-9]{6}[\s]?
1https://www.gov.uk/hmrc-internal-manuals/national-insurance-manual/nim39110
8 Workshop 1: Text query with Regular Expressions
6. The final letter must be one of A, B, C, or D:
(A|B|C|E|G|H|J|K|L|M|N|O|P|R|S|T|W|X|Y)(A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|W|X|Y)[\s]?[0-9]{6}[\s]?(A|B|C|D)
text = list()
text.append("AA 123456 B") # Valid
text.append("AO 123456 B") # Not valid
text.append("AQ123456 B") # Not valid
text.append("QQ 123456 B") # Not valid
text.append("AA 123456 X") # Not valid
text.append("AA 12 34 56 B") # Not valid
text.append("AA123456B") # Valid
text.append("AA12345B") # Not valid
text.append("A 123456 B") # Not valid
text.append("A 123456 B") # Not valid
regex =
"(A|B|C|E|G|H|J|K|L|M|N|O|P|R|S|T|W|X|Y)(A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|W|X|Y)[\s]?[0-9]{6}[\s]?(A|B|C|D)"
result = list()
for t in text:
x = re.match(regex, t)
if(x != None):
print(t,"->",x.group())
else:
print(t,"-> No match")
result.append(x)
print("")
for i in range(len(text)):
if(result[i]!=None and result[i].group()==text[i]):
print("VALID\t",text[i])
else:
print("-----\t",text[i])
The output will look like:
AA 123456 B -> AA 123456 B
AO 123456 B -> No match
AQ123456 B -> No match
QQ 123456 B -> No match
AA 123456 X -> No match
AA 12 34 56 B -> No match
AA123456B -> AA123456B
AA12345B -> No match
A 123456 B -> No match
A 123456 B -> No match
VALID AA 123456 B
----- AO 123456 B
----- AQ123456 B
----- QQ 123456 B
----- AA 123456 X
----- AA 12 34 56 B
VALID AA123456B
----- AA12345B
----- A 123456 B
----- A 123456 B
1.3.5 Validation of hexadecimal numbers
Let’s use regular expressions to check whether a string corresponds to a hexadecimal number. Consider the
strings “xAF1400BD”, “1299ab32”, “xFF00FF5R”, “0xaa00bb”. How can we check if these strings are represen-
tations of hexadecimal numbers? Remember that valid hexadecimal digits are [0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,
a,b,c,d,e,f] and that in computers, hexadecimal numbers may be denoted with an “x” or “0x” (either lowercase
9or uppercase) in the beginning. For example, the hexadecimal number FFF, can also be written as fff, 0xfff,
0XFFF, xfff, XFFF and can also have mixed lowercase and uppercase letters.
text = list()
text.append("xAF1400BD") # Valid
text.append("1299ab32") # Valid
text.append("xFF00FF5R") # Not valid - Character R is not a valid headecimal digit
text.append("0xaa00bb") # Valid
text.append("xaa00bb4657AB000922334bce111A") # Valid
text.append("0xfff") # Valid
text.append("xFfF") # Valid
text.append("AAA") # Valid
text.append("ALA") # Not valid - Character L is not a valid headecimal digit
regex = "(0x|0X|x|X)?[0-9a-fA-F]+" # One optional occurence of 0x, 0X, x, or X, followed by at least
one digit from 0 to 9, or lowercase letter from a to f, or uppercase letter from A to F
result = list()
for t in text:
x = re.match(regex, t)
if(x != None):
print(t,"->",x.group())
else:
print(t,"-> No match")
result.append(x)
print("")
for i in range(len(text)):
if(result[i]!=None and result[i].group()==text[i]):
print("VALID HEX\t",text[i])
else:
print("---------\t",text[i])
The output will look like:
xAF1400BD -> xAF1400BD
1299ab32 -> 1299ab32
xFF00FF5R -> xFF00FF5
0xaa00bb -> 0xaa00bb
xaa00bb4657AB000922334bce111A -> xaa00bb4657AB000922334bce111A
0xfff -> 0xfff
xFfF -> xFfF
AAA -> AAA
ALA -> A
VALID HEX xAF1400BD
VALID HEX 1299ab32
--------- xFF00FF5R
VALID HEX 0xaa00bb
VALID HEX xaa00bb4657AB000922334bce111A
VALID HEX 0xfff
VALID HEX xFfF
VALID HEX AAA
--------- ALA
Note that in the case of the “ALA” string, the regex found the match “A”, which is a valid hexadecimal number,
but the full string “ALA” is not a valid hexadecimal number. When validating input, remember to check that
the matched string is equal to the query string.
1.4 Using Regular Expressions to search elements in files
1.4.1 Parsing XML files
Let’s use regular expressions to parse an XML file. We will use the movies.xml file which contains the titles and
release dates of 5 movies. We will use a regular expression to retrieve all the movie titles from the file. First,
10 Workshop 1: Text query with Regular Expressions
copy the file ”movies.xml” to your current working directory or retrieve the absolute path of the file. Then
open the file, load its contents into a variable and close the file.
f = open("movies.xml", "r") # Opens the file for reading only ("r")
text = f.read() # Store the contents of the file in variable "text". read() returns all the contents
of the file
f.close() # Close the file
print(text) # Print the contents of variable "text"
The output will look like:

1971

1974

1979

1982

1983

As you can see above, all movie titles in the movies.xml XML file are enclosed within the
tags. Let’s use a regular expression to match all the strings that are enclosed within these tags.
regex = "" #
x = re.findall(regex, text) # Find all matches of the regex
print(x,"\n")
print("Movie titles from movies.xml:")
for title in x:
title = title.replace("","") # Remove by replacing it with empty string
print(title)
The output will look like:
['', '', "", '', ""]
Movie titles from movies.xml:
And Now for Something Completely Different
Monty Python and the Holy Grail
Monty Python's Life of Brian
Monty Python Live at the Hollywood Bowl
Monty Python's The Meaning of Life
11
1.4.2 Parsing HTML files
Let’s read the links.html HTML file and use regular expressions to find all links within the file. Remember
that in HTML, links are denoted using the and tags and the link url is provided using the “href”
attribute within the tag. For example a link for the main website of Durham University would be:
Durham University
f = open("links.html", "r") # Opens the file for reading only ("r")
text = f.read() # Store the contents of the file in variable "text". read() returns all the contents
of the file
f.close() # Close the file
print(text,"\n") # Print the contents of variable "text"
regex = 'x = re.findall(regex, text) # Find all matches of the regex
print(x,"\n")
print("Links from links.xml:")
for link in x:
link = link.replace('print(link)
The output will look like:

Test page for Text Mining and Language Analytics

Test page for Text Mining and Language Analytics

Link 1

Link 2

Link 3

Link 4

Link 5

89
90 Appendix: Test files
A.3 emails.txt
john+acme.co@hotmail.com
bob@gmail.com
tom@durham.ac.uk
jerry@durham.ac.uk
scrooge@durham.ac.uk
donald@yahoo.co.uk
huey@yahoo.co.uk
dewey@gmail.com
louie.duck@durham.ac.uk
gyro.gearloose@yahoo.co.uk
bart@yahoo.co.uk
homer@gmail.com
stan@hotmail.com
kyle-broflovski@durham.ac.uk
eric@yahoo.co.uk
kenny@gmail.com
butters@durham.ac.uk
wendy@hotmail.com
randy marsh@durham.ac.uk
chef@gmail.com
A.4 cds.xml

Bob Dylan
USA
Columbia
10.90
1985

Bonnie Tylor
UK
CBS Records
9.90
1988

Dolly Parton
USA
RCA
9.90
1982

Gary More
UK
Virgin redords
10.20
1990

Eros Ramazzotti
91
EU
BMG
9.90
1997

Bee Gees
UK
Polydor
10.90
1998

Dr.Hook
UK
CBS
8.10
1973

Rod Stewart
UK
Pickwick
8.50
1990

Andrea Bocelli
EU
Polydor
10.80
1996

Percy Sledge
USA
Atlantic
8.70
1987

Savage Rose
EU
Mega
10.90
1995

Many
USA
Grammy
10.20
1999

Kenny Rogers
UK
Mucik Master
92 Appendix: Test files
8.70
1995

Will Smith
USA
Columbia
9.90
1997

Van Morrison
UK
Polydor
8.20
1971

Jorn Hoel
Norway
WEA
7.90
1996

Cat Stevens
UK
Island
8.90
1990

Sam Brown
UK
A and M
8.90
1988

T`Pau
UK
Siren
7.90
1987

Tina Turner
UK
Capitol
8.90
1983

Kim Larsen
EU
Medley
7.80
1983
93

Luciano Pavarotti
UK
DECCA
9.90
1991

Otis Redding
USA
Atlantic
7.90
1987

Simply Red
EU
Elektra
7.20
1985

The Communards
UK
London
7.80
1987

Joe Cocker
USA
EMI
8.20
1987

A.5 alice.txt
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or
twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ”and
what is the use of a book,” thought Alice ”without pictures or conversations?”
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and
stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the
daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the
Rabbit say to itself, ”Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to
her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit
actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her
feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a
watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in
time to see it pop down a large rabbit-hole under the hedge.
94 Appendix: Test files
A.6 dune.txt
In the week before their departure to Arrakis, when all the final scurrying about had reached a nearly unbearable
frenzy, an old crone came to visit the mother of the boy, Paul.
It was a warm night at Castle Caladan, and the ancient pile of stone that had served the Atreides family as
home for twenty-six generations bore that cooled-sweat feeling it acquired before a change in the weather.
The old woman was let in by the side door down the vaulted passage by Paul’s room and she was allowed a
moment to peer in at him where he lay in his bed.
By the half-light of a suspensor lamp, dimmed and hanging near the floor, the awakened boy could see a bulky
female shape at his door, standing one step ahead of his mother. The old woman was a witch shadow - hair like
matted spiderwebs, hooded ’round darkness of features, eyes like glittering jewels.