SECU0050/SECU0057 -R代写

时间：2023-05-01

UCL DEPARTMENT OF SECURITY AND CRIME SCIENCE
Week 2 –Web Data Collection II
SECU0050/SECU0057
Nilufer Tuptuk
• Limitations of APIs
• More HTML and CSS
• Web scraping in R with rvest
• Rselenium
• Web scraping best practices
• Tutorial
CRIME SCIENCE
Plan for today
Department of Security and Crime Science
Limitations
• Lack of API: Not all platforms provide API
• Availability of data: Not all APIs share all their data
• Freshness of data: New data might not be available
immediately
• Rate limits: How much data you can get at each call, the
time between each call, and the number of data you get per
call, etc.
Application Programming Interface
Department of Security and Crime Science
• Process of extracting unstructured data automatically from
a webpage and transforming it into a structured dataset
that we can analyse
• Two main steps
1) Fetch (download) the HTML pages (source code) that
contain the data
2) Extract the relevant data from the HTML pages
Web scraping
Department of Security and Crime Science
A simple HTML Document
Department of Security and Crime Science
AND CR

Web Scraping

Welcome to Data Science

Here we write paragraphs of
text.

Security

Crime

Science

body: Contains all the content of an HTML
document: headings, paragraphs, tables,
lists, images, hyperlinks, etc.
: Contains metadata about the
HTML document: e.g. document title,
character set, styles, links, scripts
Declaration of the document type
: The root element: parent to all
other HTML elements
A simple HTML Document
Department of Security and Crime Science
AND CR

Web Scraping

Welcome to Data Science

Here we write paragraphs of
text.

Security

Crime

Science

HTML Elements
Department of Security and Crime Science
AND CR

Web Scraping

Welcome to Data Science

Here we write paragraphs of text.

Security

Crime

Science

• HTML documents consist of elements
• HTML elements generally have a start tag <>,
content and end tag.
• start tags: , , ,
, , ,

, ,

• content: inserted in between tags
Document Object Model (DOM)
Department of Security and Crime Science
AND CR• In HTML DOM every element is a node
Reference: https://www.w3schools.com/js/js_htmldom_navigation.asp
Cascading Style Sheets (CSS)
Department of Security and Crime Science
AND CR
id: attribute is used to give a unique id for an element –
can be used by one HTML element
class: multiple HTML elements can belong to the same
class
Why CSS is useful for us?
• We will be identifying elements via CSS selector
notation
• Selecting by id: #myHeader
• Selecting by class: .language

Languages

English

Welsh

CSS Selectors
Department of Security and Crime Science
Selector Example Description
element p selects all

elements
element.class p.intro selects all

elements
with class “intro”.
.class .title selects all elements with
class “title”.
#id #contact selects the element with
the id attribute “contact”.
element element div p Selects all

elements
inside

elements
:first-child p:first-child Selects every

element
that is the first child of its
parent
Reference and for more CSS Selectors: https://www.w3schools.com/cssref/css_selectors.asp
Web scraping in R: rvest
Department of Security and Crime Science
• rvest is an R package that helps to scrape data from web page
• Very popular and a lot of online material and help available
• More information on the rvest can be found on the CRAN
available at: https://cran.r-
project.org/web/packages/rvest/index.html
Web scraping with rvest: FBI’s Cyber’s Most Wanted
Department of Security and Crime Science
AND CRTarget URL: https://www.fbi.gov/wanted/cyber
Core steps for web scraping:
• Examine the webpage
• Decide the data you want to scrape from
the webpage
• Identify the CSS selectors:
• Use the Inspect element in the
browser
• Other tools (e.g. selectorGadget)
• Write a program using the rvest package
SelectorGadget
Department of Security and Crime Science
• SelectorGadget to identify relevant CSS selectors.
• See a short tutorial video available at https://selectorgadget.com
• Search and download from
https://chrome.google.com/webstore/category/extensions
Web scraping with rvest: FBI’s Cyber’s Most Wanted
Department of Security and Crime Science
AND CRTarget URL: https://www.fbi.gov/wanted/cyber
Decide the data you want to scrape from the webpage
1. Get a list of all names
2. Get bio details of all names
FBI’s Cyber’s Most Wanted: Identify CSS selectors
Department of Security and Crime Science
AND CR
Key here: look for the

heading with class title for names
Steps for web scraping with rvest
Department of Security and Crime Science
1. Install and load the rvest library
2. Read a webpage into R by specifying the URL of the web page you
would like to scrap using the function read_html()
3. Extract specified elements out of HTML documents using the
functions (html_element(), html_elements()) and CSS selectors.
Web scraping with rvest
Department of Security and Crime Science
4. extract components of elements using functions like
- html_text(): raw underlying text
- html_text2(): simulates how text looks in a browser
- html_table(): parse an html table into a data frame
- html_attr() and html_attrs(): get element attributes
• Attributes are special words used within a tag to provide additional
information about HTML elements:
• University College London

• HTML element (or anchor element) with its href attribute
The HTML element (or anchor element), with its href attribute, creates a hyperlink to web pages, files, email addresses, locations in the same page, or anything else a URL can address.
FBI’s Cyber’s Most Wanted: Read HTML
Department of Security and Crime Science
AND CR#Step 1: Install and load the rvest library
install.package("rvest")
library(rvest)
# Step 2: read the webpage into R
target_url <- "https://www.fbi.gov/wanted/cyber"
cyber_pages<- read_html(target_url)
cyber_pages
## {html_document}
##
## [1] \ncyber_titles <- cyber_pages %>%
html_elements('h3.title')
cyber_titles
## {xml_elementset (40)}
## [1]
\n \n \n \n ## [2] "MANSOUR AHMADI"
## [3] "AHMAD KHATIBI AGHDA"
## [4] "AMIR HOSSEIN NICKAEIN RAVARI"
## [5] "EMILIO JOSE CORREDOR LOPEZ"
…
## [37] "MOHAMMAD BAYATI"
## [38] "MEHDI FARHADI"
## [39] "HOOMAN HEIDARIAN"
## [40] "MARWAN ABUSROUR"
FBI’s Cyber’s Most Wanted: Get URL for bio details
Department of Security and Crime Science
AND CR#get URL of each cyber wanted
all_names_links <- cyber_pages %>%
html_elements('h3.title') %>%
html_elements('a') %>%
html_attr('href')
> all_names_links
## [1] "https://www.fbi.gov/wanted/cyber/iranian-cyber-actors"
## [2] "https://www.fbi.gov/wanted/cyber/mansour-ahmadi"
## [3] "https://www.fbi.gov/wanted/cyber/ahmad-khatibi-aghda"
## [4] "https://www.fbi.gov/wanted/cyber/amir-hossein-nickaein-ravari"
...
## [38] "https://www.fbi.gov/wanted/cyber/mehdi-farhadi"
## [39] "https://www.fbi.gov/wanted/cyber/hooman-heidarian"
## [40] "https://www.fbi.gov/wanted/cyber/marwan-abusrour"
FBI’s Cyber’s Most Wanted: Individual Pages
Department of Security and Crime Science
AND CR
FBI’s Cyber’s Most Wanted: Inspect HTML table
Department of Security and Crime Science
FBI’s Cyber’s Most Wanted: Get bio details
Department of Security and Crime Science
AND CR#get details of one cyber wanted
one_target_url <- all_names_links[2]
one_target_page <- read_html(one_target_url)
description <- one_target_page %>%
html_elements('div.wanted-person-description') %>%
html_table()
## description
## [[1]]
## A tibble: 6 x 2
## X1 X2
##
## 1 Date(s) of Birth Used July 7, 1988
## 2 Place of Birth Tehran Province, Iran
## 3 Hair Dark Brown
## 4 Eyes Brown
## 5 Sex Male
## 6 Nationality Iranian
FBI’s Cyber’s Most Wanted: Get all bios using for loop
Department of Security and Crime Science
AND CR#get details of all cyber wanted fugitives and store them in a list
all_descriptions <- list()
for (i in all_names_links){
temp_target_page <- read_html(i)
description <- temp_target_page %>%
html_elements('div.wanted-person-description') %>%
html_table()
index_target_url <- which(i==all_names_links)
all_descriptions[[index_target_url]] <- description
}
FBI’s Cyber’s Most Wanted: Access data in list
Department of Security and Crime Science
AND CR• Now we have a list of tables
• Each table contains details of one wanted person
> all_descriptions[[2]]
## X1 X2
##
## 1 Date(s) of Birth Used July 7, 1988
## 2 Place of Birth Tehran Province, Iran
## 3 Hair Dark Brown
## 4 Eyes Brown
## 5 Sex Male
## 6 Nationality Iranian
Static vs Dynamic Web-Scraping
Department of Security and Crime Science
AND CR• What if content requires some interaction to load?
• scrolling down for new content, filling in forms, clicking on
buttons, making a search, animations, embedded media etc.
• Dynamic content
• We need a way to automate a browser
• We need a way to simulate human-user interaction
• We need more advanced tools than rvest
Selenium
Department of Security and Crime Science
• Provides a range of tools and libraries for automating web browser
• Emulates human user interactions with browsers such as type, click,
select, scrolling, etc
• Open Source and free: https://www.selenium.dev
• Rselenium (https://github.com/ropensci/RSelenium) – Set of R bindings
for Selenium Webdriver to automate browsers
• More information available at: https://cran.r-
project.org/web/packages/RSelenium
Installing RSelenium
Department of Security and Crime Science
AND CR• install.packages("RSelenium") from the CRAN
• Often requires a new version of Java DK installed on your computer to
work
• Download the current version from here:
https://www.oracle.com/java/technologies/downloads/
• Most common browsers used are Chrome, Firefox and Internet Explorer
• You may need to update your version of the browser
Rselenium: Setup
Department of Security and Crime Science
AND CR
#create a browser instance and start a driver
#If you get an error like "Selenium server signals port = 4583 is already
#in use", use a new port, e.g. port=4557L
library(RSelenium)
#create a browser instance
selenium_firefox <- rsDriver(browser="firefox", port = 4583L, chromever =
NULL)
driver <- selenium_firefox$client
#set target URL
target_url <- 'https://www.fbi.gov/wanted/cyber'
#navigate to the FBI Wanted Cyber page using the target URL
driver$navigate(target_url)
Timeouts when scraping web pages
Department of Security and Crime Science
AND CR
• Controlling the rate of scraping is important
• Avoid overloading the server with tens of requests per second – don’t
disrupt/harm the activity of the website
• Fast-paced requests coming from the same IP address are likely to get
banned
• Gather data during the off-peak hours of the website
• Mimic the normal behaviour of a human user
RSelenium – Timeouts when scraping web pages
Department of Security and Crime Science
AND CR
#To mimic human behaviour we introduce random wait intervals using
function Sys.sleep() between requests.
list_for_requests = list()
for(i in 1:5){
parsed_pagesource <- driver$getPageSource()[[1]]
result <- read_html(parsed_pagesource) %>%
html_elements('h3.title') %>%
html_element(‘a’)
list_for_requests[[i]] <- result
print(result)
print(paste('Sent request at:', Sys.time(), sep=" "))
Sys.sleep(runif(1, 1, 5))
}
RSelenium – Scrolling up/down web pages
Department of Security and Crime Science
AND CR# Navigate the driver to the FBI Wanted Cyber page using the target URL
driver$navigate(target_url)
# Find the html body
page_body <- driver$findElement("css", "body")
# To scroll down/up a web page use sendKeys function with parameters
page_body$sendKeysToElement(list(key = "page_down"))
Sys.sleep(5)
page_body$sendKeysToElement(list(key = "page_up"))
`
RSelenium – Scroll and get details of all cyber wanted
Department of Security and Crime Science
AND CRsimulate multiple scrolls
#navigate the driver (= simulated browser) to the target URL
driver$navigate(target_url)
#find the HTML body
page_body <- driver$findElement("css", "body")
#send multiple scroll commands in a loop
for(i in 1:15){
page_body$sendKeysToElement(list("key"="page_down"))
# allow some time for this to happen
Sys.sleep(runif(1, 1, 5))
}
#now access the page source
parsed_pagesource <- driver$getPageSource()[[1]]
#now we can scrape from the page after the simulation
full_results <- read_html(parsed_pagesource) %>%
html_elements('h3.title') %>%
html_elements('a') %>%
html_attr('href')
length(full_results)
RSelenium – Close connections
Department of Security and Crime Science
AND CR# close the driver and the server
driver$close()
selenium_firefox$server$stop()
Web scraping best practices
Department of Security and Crime Science
• Understand and exploit structure of web pages.
• Check the terms and conditions of the websites for legality of scraping.
• Use robotstxt to check if you are allowed to scrap and what you are
allowed to scrap.
• Observe rate limits – don’t disrupt the operation of a website.
• Only scrap data if you see a value – problem first then the method.
• You may need ethics approval
• UCL REC defines “scraping data” as high-risk: See Applying to the
UCL REC | UCL Research Ethics - UCL – University College London
Robotstxt
Department of Security and Crime Science
• Robots.txt – found at the root a domain describing what sections of
website robots can access and conditions (e.g. delays between calls)
• https://en.wikipedia.org/robots.txt
Understanding Robotstxt
Department of Security and Crime Science
• User-agent: name of the web robot or scraper
• User-agent: * = applies to all robots/users
• To specify a robot e.g., User-agent: Googlebot, User-agent: AdsBot-Google
• Allow: Scraping okay for the given page or directory (defined with a / e.g., /path/)
• Disallow: Scraping okay for the given page or directory (defined with a / e.g.,
/path/)
• Crawl-delay: N , minimum waiting time between each request to the website
User-agent: Googlebot-news
Allow: /
User-agent: *
Disallow: /
• You can usually access the robots by calling the domain of the website and adding
the ”/robots.txt”
• https://www.google.com/robots.txt, https://www.yahoo.com/robots.txt
• For more information visit the following sites:
http://www.robotstxt.org/robotstxt.html
• https://developers.google.com/search/docs/advanced/robots/intro
Using robotstxt from R
Department of Security and Crime Science
install.packages("robotstxt")
library(robotstxt)
target_url <- https://www.fbi.gov/
get_robotstxt(target_url)
For more information visit the following site:
https://cran.r-project.org/web/packages/robotstxt/vignettes/using_robotstxt.html
[robots.txt]
--------------------------------------
Sitemap: https://www.fbi.gov/sitemap.xml.gz
# Define access-restrictions for robots/spiders
# http://www.robotstxt.org/wc/norobots.html
# By default we allow robots to access all areas of our site
# already accessible to anonymous users
User-agent: *
Disallow:
What’s next?
Department of Security and Crime Science
Tutorial:
• Scraping data from FBI's website
Homework:
• Scraping data from a pet for sale website
• Submit code by next Tuesday to receive feedback
• Scraping data from fact-checking website: www.politifact.com
• Permission obtained from politifact.com for you to do so!

学霸联盟