UCL DEPARTMENT OF SECURITY AND CRIME SCIENCE
Week 2 –Web Data Collection II
SECU0050/SECU0057
Nilufer Tuptuk
• Limitations of APIs
• More HTML and CSS
• Web scraping in R with rvest
• Rselenium
• Web scraping best practices
• Tutorial
CRIME SCIENCE
Plan for today
Department of Security and Crime Science
Limitations
• Lack of API: Not all platforms provide API
• Availability of data: Not all APIs share all their data
• Freshness of data: New data might not be available
immediately
• Rate limits: How much data you can get at each call, the
time between each call, and the number of data you get per
call, etc.
Application Programming Interface
Department of Security and Crime Science
• Process of extracting unstructured data automatically from
a webpage and transforming it into a structured dataset
that we can analyse
• Two main steps
1) Fetch (download) the HTML pages (source code) that
contain the data
2) Extract the relevant data from the HTML pages
Web scraping
Department of Security and Crime Science
A simple HTML Document
Department of Security and Crime Science
AND CR
Welcome to Data Science
Here we write paragraphs of
text.
body: Contains all the content of an HTML
document: headings, paragraphs, tables,
lists, images, hyperlinks, etc.
: Contains metadata about the
HTML document: e.g. document title,
character set, styles, links, scripts
Declaration of the document type
: The root element: parent to all
other HTML elements
A simple HTML Document
Department of Security and Crime Science
AND CR
Welcome to Data Science
Here we write paragraphs of
text.
HTML Elements
Department of Security and Crime Science
AND CR
Welcome to Data Science
Here we write paragraphs of text.
• HTML documents consist of elements
• HTML elements generally have a start tag <>,
content and end tag.
• start tags: , , ,
, , ,
, ,
• content: inserted in between tags
Document Object Model (DOM)
Department of Security and Crime Science
AND CR• In HTML DOM every element is a node
Reference: https://www.w3schools.com/js/js_htmldom_navigation.asp
Cascading Style Sheets (CSS)
Department of Security and Crime Science
AND CR
id: attribute is used to give a unique id for an element –
can be used by one HTML element
class: multiple HTML elements can belong to the same
class
Why CSS is useful for us?
• We will be identifying elements via CSS selector
notation
• Selecting by id: #myHeader
• Selecting by class: .language
English
Welsh
CSS Selectors
Department of Security and Crime Science
Selector Example Description
element p selects all
elements
element.class p.intro selects all
elements
with class “intro”.
.class .title selects all elements with
class “title”.
#id #contact selects the element with
the id attribute “contact”.
element element div p Selects all
elements
inside
elements
:first-child p:first-child Selects every
element
that is the first child of its
parent
Reference and for more CSS Selectors: https://www.w3schools.com/cssref/css_selectors.asp
Web scraping in R: rvest
Department of Security and Crime Science
• rvest is an R package that helps to scrape data from web page
• Very popular and a lot of online material and help available
• More information on the rvest can be found on the CRAN
available at: https://cran.r-
project.org/web/packages/rvest/index.html
Web scraping with rvest: FBI’s Cyber’s Most Wanted
Department of Security and Crime Science
AND CRTarget URL: https://www.fbi.gov/wanted/cyber
Core steps for web scraping:
• Examine the webpage
• Decide the data you want to scrape from
the webpage
• Identify the CSS selectors:
• Use the Inspect element in the
browser
• Other tools (e.g. selectorGadget)
• Write a program using the rvest package
SelectorGadget
Department of Security and Crime Science
• SelectorGadget to identify relevant CSS selectors.
• See a short tutorial video available at https://selectorgadget.com
• Search and download from
https://chrome.google.com/webstore/category/extensions
Web scraping with rvest: FBI’s Cyber’s Most Wanted
Department of Security and Crime Science
AND CRTarget URL: https://www.fbi.gov/wanted/cyber
Decide the data you want to scrape from the webpage
1. Get a list of all names
2. Get bio details of all names
FBI’s Cyber’s Most Wanted: Identify CSS selectors
Department of Security and Crime Science
AND CR
Key here: look for the