GR5067
Natural Language Processing
Quantitative Methods – Social Sciences (QMSS)
Professor: Patrick Houlihan
Flow Diagram
TrainData Required
Product Description
Product Usage
Product expectation(s)
Product solves what problems
Internal use case
External use or Value
Value
1st party
2nd party
3rd party
Aggregate
Data
Analyze Data SummarizeVisualize
Prepare Data TransformationFeature
Selection
Data Wrangling Validation
Test
Parametric Grid Search
Model Tuning
Model Selection
Performance Measures
Monitor Recalibration
MODELINGDATA
Flow Diagram
Data Retrieval Pre-Processing Tokenization Stemming Vectorization
TrainTest & ValidationClassification DimensionReduction
Topical Extraction
● Determine topics/themes from corpus
● Unsupervised machine learning technique to classify documents into a specific topic
● Latent Dirichlet Allocation (LDA)
Bayes Theorem
● Conditional Probabilities
● Predictors are independent from one another
● Let’s say we know the following during the past 100 days
• It was cloudy on 40 days P(cloudy) = 40/100 = .40
• It rained on 30 days P(rainy) = 30/100 = .30
• It was both raining and cloudy on 25 day P(rainy|cloudy) = 25/40 = .625
● We can solve for what is probability it will rain given its cloudy
● P(cloudy|rainy) = P(rainy|cloudy)*p(cloudy) / p(rainy) = (.625*.40)/.30 = 0.833
Bayes Theorem
Word Count Sports = 11
Word Count Not Sports = 9
Unique Words Count = 14
We add a 1 to each to force a never zero condition: Called Laplacian Smoothing
A generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-
level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture
over an underlying set of latent topics. Each observed word originates from a topic that we do not
directly observe. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic
probabilities.
What is Latent Dirichlet
Allocation (LDA)?
What
is used for? The fitted model can be used to estimate the similarity
between documents as well as between a set of specified keywords using
an additional layer of latent variables which are referred to as topics.
How is it related to text
mining and other machine
learning techniques?
Topic models can be seen as classical text mining or natural language processing tools. Fitting topic
models based on data structures from the text mining usually done by considering the problem of
modeling text corpora and other collections of discrete data. One of the advantages of LDA over
related latent variable models is that it provides well-defined inference procedures for previously
unseen documents (LSI uses a singular value decomposition)
Latent Dirichlet Allocation
● LDA is a mixture of topics
● p(topic t | document d)
○ Similar to transformation, except we determine what is the probability a word belongs to a specific document
■ How would you go about calculating this?
● p(word w | topic t)
○ Captures how many documents are in a specific topic, t, due to a specific word, w
● Probability a word w belongs top topic Z:
p(word w with topic z) = p(topic z | document d) * p(word w | topic z)
Latent Dirichlet Allocation
● Measures
○ Topic Coherence: score a single topic by measuring the degree of semantic similarity between high scoring words in
the topic
■ C_v measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation
measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity
■ C_p is based on a sliding window, one-preceding segmentation of the top words and the confirmation measure of
Fitelson’s coherence
■ C_uci measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given
top words
■ C_umass is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional
probability as confirmation measure
■ C_npmi is an enhanced version of the C_uci coherence using the normalized pointwise mutual information (NPMI)
■ C_a is baseed on a context window, a pairwise comparison of the top words and an indirect confirmation measure that
uses normalized pointwise mutual information (NPMI) and the cosine similarity
○ Perplexity Score: measure how good the model is, lower score the better
○ Gensim API: https://radimrehurek.com/gensim/models/ldamodel.html
Latent Dirichlet Allocation
Latent Dirichlet Allocation
NUMPY CHEAT SHEET
Python Math Commands
Command name Description
abs(value) absolute value
ceil(value) rounds up
cos(value) cosine, in radians
floor(value) rounds down
log(value) logarithm, base e
log10(value) logarithm, base 10
max(value1, value2) larger of two values
min(value1, value2) smaller of two values
round(value) nearest whole number
sin(value) sine, in radians
sqrt(value) square root
To use some of these you need to import the math library from math import *
Python
Regular Expressions
● Regular Expression aka REGEX
● Cheat Sheet
● Great library for regex called re (regular expression)
○ Import re
Set Logic
Intersection
s1 & s2
Union
s1 | s2
Set Difference
s1 – s2
Set Symmetric
Difference
s1 ^ s2
DATETIME
● One of the more frustrating things with programming languages
● Most typical datetime formats:
○ from datetime import *
○ yyymmdd
○ yyyymmdd:hhmmss
PANDAS