PUBL0055: Introduction to Quantitative Methods
Lecture 1: Introduction
Jack Blumenau and Benjamin Lauderdale
1 / 45
Lecture Outline
Course Outline
Logistics
Quantitative Methods and Research Design
Introduction to Quantitative Data
Conclusion
2 / 45
Course Outline
What is this course?
• This is not a course on statistics
• A statistics course would focus on the theory and derivation of
statistical methods
• We will discuss some theory at a basic level, but will not concern
ourselves with the derivation
• This is a course on applied quantitative research methods
• Focus on the developing intuition about quantitative methods
• Focus on using these methods to answer social science questions
• This course is different to other similar courses
• Stronger focus on causality, data vizualisation, and application
• Less focus on sampling, statistical inference and uncertainty
3 / 45
What is in this course?
1. Introduction
2. Causality
3. Describing Quantitative Data
4. Regression I (Prediction)
5. Regression II (Specification)
7. Regression III (Causality)
8. Panel Data
9. Sampling, Uncertainty and
Confidence Intervals
10. Hypothesis Testing &
Uncertainty in Regression
11. Additional Topics / Summing
Up
Week 6 is reading week. There will be no lecture, but you will have a
midterm assessment.
4 / 45
Why should you take research methods?
• This is a course on quantitative methods, not all research methods
• Many of you will also take a qualitative methods module –
PUBL0010, PUBL0085, or PUBL0058
• The science of the ‘social sciences’ comes from the methodological
rigour of the approaches you will learn in these courses
• These courses will…
• …provide you with the tools necessary to conduct social scientific
research (relevant for writing your dissertations)
• …help you to better understand and evaluate quantitative claims
(relevant for evaluating plausibility of current research)
• …help you to think more critically about evidence-based arguments
made in the ‘real world’ (relevant for being a good human being)
5 / 45
Why should you take quantitative research methods?
You will learn…
• …to apply a wide range of quantitative methods to answering your
potential research questions
• …the types of questions that can (and cannot) be answered using
quantitative analysis
• …to make more persuasive arguments using quantitative data
• …to evaluate the quantitative evidence others present in their work
• …some transferable skills
6 / 45
Logistics
Course Website
• The course website has several important resources for this module
• Weekly class assignments and datasets
• The website can be found at https://uclspp.github.io/PUBL0055/
7 / 45
Moodle
• In addition to the course website, Moodle access is essential for this
course
• Lecture slides and recordings
• Links to office hours signups
• Assessments
• Students will be automatically enrolled in Week 1, but you can get
access sooner by enrolling manually:
• Enrollment key for PUBL0055 – regression
8 / 45
Lecturers
Jack Blumenau
• E-mail: j.blumenau@ucl.ac.uk
• Office Hours: Sign up via link on Moodle
Benjamin Lauderdale
• E-mail: b.lauderdale@ucl.ac.uk
• Office Hours: Sign up via link on Moodle
9 / 45
Teaching fellows
• Chris Butler
• Lorenzo Crippa
• Julia De Romémont
• Eleanor Iob
• Thiago Rodrigues-Oliveira
• Stephanie Thiehoff
Please check the Moodle page for their office hour times and links to sign
up.
10 / 45
Introduction or Advanced?
We offer two quantitative methods modules at the MSc level:
Introduction Advanced
Term One Two
Pre-requisites (methods) None One prior course
Pre-requisites (R) None None
Substantive focus Intro to quant methods Causal inference
On most of the MSc programmes, it is possible for you to take both this
course and the advanced course.
11 / 45
Which course should I take?
• This course has no pre-requisites: we will assume that you have no
prior experience in either quantitative methods, or in coding
• The Advanced course requires you to have at least one prior course
in quantitative methods/econometrics up the level that we cover on
this course
• If you are unsure which course to take
1. Take this quiz
2. Book an office hour appointment to speak to Jack
12 / 45
Learning objectives
1. Understand the key tools used in modern quantitative methods
2. Understand which questions are and are not amenable to
quantitative analysis
3. Improve ability to critically evaluate published work
4. Learn to implement key skills in R
Teaching philosophy
1. Building intuition is central to understanding statistical concepts
2. Examples and applications are central to building intuition
3. You cannot learn statistics or quantitative methods without
analysing data on your own
13 / 45
Textbook
• New book which includes many
social science examples and
focuses on R code
implementations
• We recommend buying this
book, although some copies
are available in the library
• We will provide additional
notes on some topics
14 / 45
Advice on reading for this course
Statistical readings can be intimidating and on this course you should
focus on an in depth reading of the textbook, rather than a broad and
shallow reading of multiple sources.
1. Do the required reading before lecture
2. Do not expect to understand everything the first time
3. If overwhelmed, focus on the text, not the equations
4. After lecture, re-read to maximize understanding
15 / 45
R and Rstudio
• R is statistical programming language and software for data analysis
• Rstudio is software package that makes R more straigtforward to use
• Why do we use R/Rstudio on this course? R is…
• …free!
• …more flexible than some alternatives – e.g. Excel, SPSS
• …widely used by researchers, companies, governments, non-profits,
etc
• …also used on the Advanced Quantitative Methods module
• Learning to use R is essential to do well in this course
• You should install R and Rstudio on your personal computers
• Don’t worry if you have trouble the first few weeks!
16 / 45
Lectures and classes
Lectures
• Lecturers will alternate weeks
• Lectures will be 60-80 minutes per week
• Lecture recordings will be uploaded by Friday of the preceding week.
Seminars
• One hour seminar slots
• Wednesday 4pm & 5pm
• Thursday 9am, 10am, 11am, 3pm, 4pm, 5pm
• Friday 9am, 10am
• Seminar attendance is mandatory
17 / 45
Homework
• The instructions and code you need for the seminars and homework
will be available on the course website
• You should work through the exercises before your scheduled
seminar time
• The site also includes useful information about the course,
quantitative methods, and coding in R
• Each seminar includes a homework exercise which focusses on
implementing the skills you have learned on new data
• Solutions will be posted on the Wednesday following the Friday class
• These homeworks are not assessed, but they will be very similar in
style to the assessments!
18 / 45
Assessment
• 25% of the course mark is based on a midterm coursework (1000
words, issued November 6, due November 11)
• 75% of the course mark is based on a final coursework (3000 words,
issued December 18, due January 11)
• The two courseworks will require you to:
• understand the theoretical concepts
• answer applied questions
• work with R
• Details will follow during the term
19 / 45
Quantitative Methods and Research Design
Which part of the research process are we working on?
center
Theorize
Research
Question
Hypothesize
Data
Collection
Data
Analysis
20 / 45
Which part of the research process are we working on?
center
Theorize
Research
Question
Hypothesize
Data
Collection
Data
Analysis
• A question that identifies
the problem or puzzle one
seeks to answer.
• E.g. Does economic
development cause
democratization?
20 / 45
Which part of the research process are we working on?
center
Theorize
Research
Question
Hypothesize
Data
Collection
Data
Analysis
• An explanation of why or
how something happens
• E.g. Economically
developed countries are
more likely to be
democratic because they
have a large middle-class
that moderates political
conflicts (Lipset 1959;
Moore 1966)
20 / 45
Which part of the research process are we working on?
center
Theorize
Research
Question
Hypothesize
Data
Collection
Data
Analysis
• A theory-based statement
about a relationship we
expect to observe
• E.g. Economically
developed countries 1) are
more likely to be
democratic, 2) will have a
large middle class, 3) will
have more moderate
political parties
20 / 45
Which part of the research process are we working on?
center
Theorize
Research
Question
Hypothesize
Data
Collection
Data
Analysis
• Process of systematically
gathering and measuring
information on variables
of interest
• E.g. For many countries,
record the level of
democracy; level of
development; size of
middle class; etc
20 / 45
Which part of the research process are we working on?
center
Theorize
Research
Question
Hypothesize
Data
Collection
Data
Analysis
• Use the data you have
collected to provide
evidence either for or
against your theory
20 / 45
Which part of the research process are we working on?
center
Theorize
Research
Question
Hypothesize
Data
Collection
Data
Analysis
• We will only focus on the
final two stages, with most
emphasis on the analysis
stage!
• PUBL0054 will focus on
other parts of this process
• PUBL0010, PUBL0085, and
PUBL0086 will introduce
other types of data
analysis
20 / 45
Description, prediction and causation
Within this scope, we will cover different types of research questions.
1. Description
• Aims to describe differences in attributes across different units
• E.g. Do men and women have different political preferences? Do
politicans have the same priorities as their constituents?
2. Prediction
• Aims to forecast likely outcomes of social processes
• E.g. Who will win the next general election? What predicts civil war
outbreaks? When will the next recession occur?
3. Causation
• Aims to establish the causal effects of one phenomenon on another
• E.g. Did austerity cause Brexit? Does education increase income?
What are the effects of immigration on employment?
21 / 45
Break
22 / 45
Introduction to Quantitative Data
Example
Who voted in the 2015 general election?
An important question in studies of representation is whether those
who vote are similar to those who do not vote. This descriptive question
can only be answered empirically: we need to look at data on the
composition of voters and non-voters in an election.
We will use the 2015 British Election Study for this purpose.
• Survey conducted at each general election in the UK
• Face-to-face interviews of a representative sample of the population
23 / 45
Units and variables
There are 2 organising features of any data that we study
1. Units ( ∈ 1, ..., )
• The objects that we are studying
• Usually these are the rows of the dataset
• E.g. individuals; countries; companies; Members of Parliament; etc
• We usually use to indicate a unit, and to mean the total number
of units
2. Variables
• Measurements of characteristics that vary across units
• Usually these are the columns of the dataset
• E.g. age; income; vote choice; profit/loss; GDP; etc
The first question we should ask when given data is “what are the units
and variables in this data?”
24 / 45
Dependent and independent variables
An important conceptual distinction between types of variable:
• Dependent variable ( )
• Variable to be explained
• Also called the outcome or response variable
• Independent variables ()
• Determinant(s) of the dependent variable
• Also called the explanatory or predictor variables
• Sometimes (somewhat confusingly) expressed as or
25 / 45
Units and variables (example)
In our British Election Study, the units are 1669 individuals who
responded to the survey, and the variables are listed in the table below.
Variable Description
turnout 1 if voted in 2015, 0 otherwise
age Age in years
gender Female/Male
left_right Self-placement on left (0) to right (10) scale
education Highest level of education acheived
Question: Which are the dependent and independent variables?
26 / 45
Looking at our data
We can load this data using:
bes <- read.csv("data/bes.csv")
where
• read.csv tells R we want to read data from a .csv file
• "data/bes.csv" is the location in which our file is saved
• <- is the “assignment operator” which tells R that we want to save
our data in memory
• bes is the name of the object we have saved (we can choose any
name for objects)
27 / 45
Looking at our data
We can load this data using:
bes <- read.csv("data/bes.csv")
The head() function shows the top 6 rows (units) in our data:
head(bes)
## turnout age gender left_right education
## 1 Voted 67 Female 5 GCSE
## 2 Voted 65 Female 5 Degree
## 3 Voted 65 Male 3 Degree
## 4 Voted 83 Male 5 None
## 5 Voted 56 Female 3 GCSE
## 6 Did not vote 40 Female 5 GCSE
27 / 45
Looking at our data
We can load this data using:
bes <- read.csv("data/bes.csv")
The dim() function shows the number of units and columns in our data:
dim(bes)
## [1] 1669 5
We have 1669 units, and 5 variables.
27 / 45
Looking at our data
We can load this data using:
bes <- read.csv("data/bes.csv")
The str() function gives information on the structure of our data:
str(bes)
## 'data.frame': 1669 obs. of 5 variables:
## $ turnout : Factor w/ 2 levels "Did not vote",..: 2 2 2 2 2 1 1 1 1 1 ...
## $ age : int 67 65 65 83 56 40 44 39 30 68 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 1 1 1 2 1 2 ...
## $ left_right: num 5 5 3 5 3 5 5 5 5 1 ...
## $ education : Factor w/ 4 levels "None","GCSE",..: 2 4 4 1 2 2 2 3 2 2 ...
27 / 45
Looking at our data
We can load this data using:
bes <- read.csv("data/bes.csv")
As the str() function revealed, R calls this bes object a “data frame”
A data frame is a data set with any number of variables (columns)measured
for each of any number of units (rows)
27 / 45
Levels of measurement
• Continuous/Interval
• Values indicate precise differences between categories
• Differences (intervals) have the same meaning anywhere on the scale
• E.g age
• Categorical/Nominal
• Values indicate different, mutually exclusive categories
• No relative information in the categories
• E.g gender
• Ordinal
• Values indicate relative differences between categories
• Imply a ranking, but difference between categories may be unknown
• E.g. educational achievement
Determining the correct level of measurement is important for making
decisions about how to analyse your data.
28 / 45
Sums and Sigma notation
• is the number of units or the sample size
• If = 100, we have 100 measurements of each variable
(1, 2, 3, ..., )
• We will often want to refer to the sum of a variable:
1 + 2 + 3 + ... +
But this gets cumbersome if is large!
• Instead, we will often use Sigma notation:
∑
=1
= 1 + 2 + 3 + ... +
where∑=1 means “sum up all instances of starting from 1
and ending at N”.
29 / 45
Measures of central tendency
To compare voters to non-voters, we need some way of summarising their
characteristics. The most common summaries for most variables are
those that measure the central tendency of the variable.
Central Tendency
The value of a “typical” observation, or the value of the observation at
the center of a variable’s distribution.
We will consider three measures of central tendency:
1. Mean
2. Median
3. Mode
30 / 45
Mean
The mean is the “average” or expected value of a variable
It is denoted ̄ or ̄, which can be read as “Y bar” or “X bar”
̄ = ∑
=1
=
1
∑
=1
I.e. we add up the values of and divide by the sample size.
31 / 45
Median
The median is the value of a variable that divides the data into two
groups such that there are an equal number above and below.
= {((+1)/2) when N is odd1
2 ((/2) + (/2+1)) when N is even
where is the th smallest value of variable .
I.e. the median is the middle value when the total number of
observations is odd, and the average of the two middle values when the
total number of observations is even
32 / 45
Mode
The mode is simply the most common value of a variable.
For example, in the BES data:
• 1282 respondents voted
• 387 respondents did not vote
The modal outcome for this variable is voted.
33 / 45
Mean, Median or Mode?
Which measure we use depends on the level of measurement:
• The mean is most appropriate for continuous variables
• The median is most appropriate for ordinal variables
• The mode is most appropriate for categorical variables
We will see examples throughout the course of many of these.
34 / 45
Implementing in R
Fortunately, all of these are easily implemented in R for a given variable.
• mean() is a function that calculates the mean
• the $ sign allows us to select a variable from our data
## Mean
mean(bes$age)
## [1] 53.54763
35 / 45
Implementing in R
Fortunately, all of these are easily implemented in R for a given variable.
• median() is a function that calculates the median
• here we use bes$left_right to select the left_right variable
## Median
median(bes$left_right)
## [1] 5
35 / 45
Implementing in R
Fortunately, all of these are easily implemented in R for a given variable.
• table() counts the number of times each variable value appears
• The modal value here is “Degree”
## Mode
table(bes$education)
##
## None GCSE Alevel Degree
## 383 392 339 555
35 / 45
Subsetting data
We will frequently want to subset our data in order to make statements
about different groups of observations (e.g. voters v non-voters).
We can denote subsets of a variable using subscripts. For instance:
̄=1
means “the average value of Y when X is equal to 1.”
We can then compare the average of Y in this subset to the average of Y in
another subset (i.e. ̄=0).
36 / 45
Subsetting data
We can subset our data in R using the [,] brackets, which allow us to
select certain rows and columns from the data.
To select rows use the space before the comma
bes[1:3,]
## turnout age gender left_right education
## 1 Voted 67 Female 5 GCSE
## 2 Voted 65 Female 5 Degree
## 3 Voted 65 Male 3 Degree
37 / 45
Subsetting data
We can subset our data in R using the [,] brackets, which allow us to
select certain rows and columns from the data.
To select columns use the space after the comma
bes[1:3,1:3]
## turnout age gender
## 1 Voted 67 Female
## 2 Voted 65 Female
## 3 Voted 65 Male
37 / 45
Brackets and braces and parentheses
R makes different use of ( ), [ ], and { } characters, and many new
user errors arise from confusing these.
• Parentheses ( ) are used when calling a named function to do
something to some objects.
• As in mean(bes$age), where we are using the mean() function on
the data bes$age.
• Brackets [ ] are used to access a subset of an object.
• As in bes[1,], where we are accessing the first row (unit) in bes.
• Braces { } are used for grouping multiple lines of code so that they
act like a single line of code.
• We will see these later in the module.
38 / 45
Logical values and operators
We can also use logical values and logical operators to select
rows/columns of interest.
For instance, we can ask R to return all rows in our data where the
respondent’s value for turnout is “Voted”:
bes$turnout == "Voted"
Where
• the $ says that we would like to access the turnout variable from
the bes data
• the == says we would like the elements of that variable that are
equal to the value “Voted”
We will learn more logical operators (such as <, >, >=) in the seminar.
39 / 45
Logical values and operators
We can combine == and [ ] to select rows that match a criterion:
bes_voters <- bes[bes$turnout == "Voted",]
head(bes_voters)
## turnout age gender left_right education
## 1 Voted 67 Female 5 GCSE
## 2 Voted 65 Female 5 Degree
## 3 Voted 65 Male 3 Degree
## 4 Voted 83 Male 5 None
## 5 Voted 56 Female 3 GCSE
## 11 Voted 33 Female 5 Alevel
40 / 45
Logical values and operators
We can combine == and [ ] to select rows that match a criterion:
bes_non_voters <- bes[bes$turnout == "Did not vote",]
head(bes_non_voters)
## turnout age gender left_right education
## 6 Did not vote 40 Female 5 GCSE
## 7 Did not vote 44 Female 5 GCSE
## 8 Did not vote 39 Male 5 Alevel
## 9 Did not vote 30 Female 5 GCSE
## 10 Did not vote 68 Male 1 GCSE
## 19 Did not vote 36 Male 5 GCSE
40 / 45
Logical values and operators
We can combine == and [ ] to select rows that match a criterion:
bes_voters <- bes[bes$turnout == "Voted",]
bes_non_voters <- bes[bes$turnout == "Did not vote",]
• bes_voters includes units who voted
• bes_non_voters includes units who did not vote
We can use these new datasets to characterise the central tendency of vot-
ers and non-voters for different variables.
40 / 45
Subsetting data
## Age
mean(bes_non_voters$age)
## [1] 47.86563
mean(bes_voters$age)
## [1] 55.26287
→ Voters are on average 7 years older than non-voters
41 / 45
Subsetting data
## Education
table(bes_voters$education)
##
## None GCSE Alevel Degree
## 276 260 258 488
table(bes_non_voters$education)
##
## None GCSE Alevel Degree
## 107 132 81 67
→ The modal qualification for voters is a degree, for non-voters it is GCSE
41 / 45
Subsetting data
## Left-right placement
median(bes_voters$left_right)
## [1] 5
median(bes_non_voters$left_right)
## [1] 5
→ Voters and non-voters are similar in terms of left-right placement
41 / 45
Example summary
Who voted in the 2015 general election?
Using data on 1669 individuals from the BES, we used measures of the
mean, median and mode to investigate differences between voters and
non-voters.
1. Voters are older, on average, than non-voters
2. Voters are more educated, on average, than non-voters
3. Voters and non-voters are similar in terms of ideology
42 / 45
Conclusion
What have we covered?
• Quantitative methods are a collection of tools we can use to
investigate research questions and theories
• Quantitative data is a collection of information structured in terms
of units and variables
• We can summarise variables by examining measures of central
tendency
• We can compare groups of observations using these measures of
central tendency
43 / 45
Recap of functions and notation
Code:
• read.csv() – load data into R from a .csv file
• head() – look at the first 6 rows of the data
• mean(), median() and table()
• data_object[row_indexes, column_indexes] – subsetting
data
• data_object$variable_name – selecting variables from the data
Notation:
• – a given unit
• – the total number of units (sample size)
• ∑=1 – add up all the numbers in, from the first to the th
• ̄ = ∑
=1
– the mean, or expected value, of
44 / 45
Seminar
In seminars this week, you will learn about …
1. … the Rstudio interface to R
2. … objects and assignment
3. … vectors and data.frames
4. … subsetting
• Before coming to the seminar, install R and then Rstudio on your
computer.
• https://cran.r-project.org
• https://rstudio.com/products/rstudio/download/#download
45 / 45
学霸联盟