GY476-无代写
时间:2022-11-25
GY476 - Summative Assessment 2022/23 – MSc GDS
Overview and Instructions: Computational Essay
Due Date: 15th December 2022, by 12 noon
Overview
Here’s the premise. You will take the role of a real-world GIS analyst or spatial data scientist tasked
to explore datasets on the San Francisco Bay Area (often just called the Bay Area) and find useful
insights for a variety of city decision-makers. It does not matter if you have never been to the Bay
Area. In fact, this will help you focus on what you can learn about the city through the data, without
the influence of prior knowledge. Furthermore, the assessment will not be marked based on how
much you know about the San Francisco Bay Area but instead about how much you can show you
have learned through analysing data. You will need contextualise your project by highlighting the
opportunities and limitations of ‘old’ and ‘new’ forms of spatial data and reference relevant
literature.
Format
A computational essay using R-markdown. The assignment should be carried-out fully in R-
markdown.
What is a Computational Essay?
A computational essay is an essay whose narrative is supported by code and computational results
that are included in the essay itself. This piece of assessment is equivalent to 4,000 words.
However, this is the overall weight. Since you will need to create not only narrative but also code
and figures, here are the requirements:
• Maximum of 3,000 words (ordinary text) (references do not contribute to the word
count). You should answer the specified questions within the narrative. The questions
should be included within a wider analysis.
• Up to five maps or figures (a figure may include more than one map and will only count as
one but needs to be integrated in the same overall output)
• Up to one table
There are three kinds of elements in a computational essay.
1. Ordinary text (in English)
2. Computer input (R-markdown code)
3. Computer output
These three elements all work together to express what’s being communicated.
Submission
You must submit 1 electronic copy of your summative assessment via sharepoint by the
published deadline. The format of the file must be a .zip (zipped folder), including an
html AND an R-markdown document AND any additional data or jpgs. Please do not
include your name anywhere in the documents.
• Please name your file as follows: Course_Candidate number (eg, GY476_34567.zip).
Don’t worry if your file gets renamed and please do not tell the course teacher if it does
as files should remain anonymous.
• Please refer to the GY476 Summative Assessment criteria. This document includes the
parts you should include in your Computational Essay.
GY476- Summative Assessment 2022/23 - GDS
Data
The assignment relies on datasets and has two parts. Each dataset is explained with more detail
below.
• Data made available on Murray Cox’s website as part of his “Inside Airbnb” project which
you can download (http://insideairbnb.com/). The website periodically publishes
snapshots of Airbnb listings around the world. You should Download the San Francisco
data, the San Mateo data and the Oakland data. These are all part of the Bay Area.
Please Note: that for best results you will need to drop some of the outliers.
• Socio-economic variables for the Bay Area. Source: American Community Survey (ACS)
2016-2020, US Census Bureau. Observations: 1039; Variables: 472; Years: 2016-2020.
▪ A subset of variables from the latest ACS has already been retrieved for you in
ACS_2016_2020_vars.csv. However, you have access to ALL variables in the
American Community Survey (ACS) 2016-2020 through the R package
Tidycensus.
▪ You are strongly recommended to use the census API in the R package
Tidycensus to extract your variables of interest instead of the csv. For more
information about the ACS (2016-2020) you can have a look at:
https://www.census.gov/data/developers/data-sets/acs-5year.html and
https://api.census.gov/data/2020/acs/acs5/variables.html.
If you want to visualise some aspects at different Subnational Administrative boundaries, you can
download USA boundaries from GADM. You can also find other geodata for the Bay Area in the
Berkeley Library.
You can use additional datasets, IF YOU SO CHOSE, for Part 2. If you need some inspiration, have
a look at:
• Geodata for the Bay Area in the Berkeley Library.
• San Francisco Open Data Portal: https://datasf.org/opendata/
• Data World: https://data.world/datasets/san-francisco
• NASA Data: https://earthdata.nasa.gov/earth-observation-data/near-real-time/hazards-
and-disasters/air-quality
Part 1 – Common
1.1 Collecting and importing the data
1.1.1 Import and explore
1.2 Preparing the data
1.2.1 What CRS are you going to use? Justify your answer.
1.3 Discussion of the data
• Present and describe the data sets used for this project.
1.4 Mapping and Data visualisation
1.4.1 Airbnb in the BAY AREA at Neighbourhood Level
GY476- Summative Assessment 2022/23 - GDS
• Summarise the data. Using Bay Area zipcodes obtained from Berkeley Library. This is
slightly different from the Airbnb neighbourhood file. Obtain a count of listings by
neighbourhood.
• Map 1.1: Number of listings per zipcode. Explore the spatial distribution of the data using
choropleths. Style the layers using a colour ramp.
• Map 1.2: Average price per zipcode. Explore the spatial distribution of the data using
choropleths. Style the layers using a colour ramp.
Justify your data classification methods and visualization choices. You should include these maps
in your assessment submission. The maps should be well-presented and include a short
description.
Questions to answer within your analysis: How does the Inside Airbnb data compare to other ‘new’
forms of spatial data? Discuss the potential insights and biases, as well as opportunities and
limitations of the Airbnb data.
1.4.2. Socio-economic variables from the ACS data
Select two variables from American Community Survey data. These could be but are not limited
to population density, median income, median age, unemployed, percentage of black population,
percentage of Hispanic population or education level. See the Appendix in this document for help.
If you chose to calculate population percentages, make sure you standardise the table by the
population size of each tract.
• Map2: Explore the spatial distribution of your chosen variables using choropleths. Style the
variables using a colour ramp. Justify your data classification methods and visualization
choices. You should include these maps in your assessment submission. The maps should
be well-presented and include a short description.
Questions to answer within your analysis. Comment on the details of your map and analyse the
results. What are the main types of neighbourhoods you identify? Which characteristics help you
delineate this typology? What can you say about the spatial distribution of your socio-economic
variable of interest? If you had to use this classification to evaluate where Airbnbs would cluster,
what would your hypothesis be? Why?
For some stylised (not necessarily accurate) facts about the Bay Area see here.
1.4.3. Combining Data sets
• Map 3: Plot the natural logarithm of price (ln of price) of Airbnbs in the San Francisco Bay
Area together (point plot) with one of your chosen socio-economic variables of interest
at zipcode level using ggplot or tmap or mapsf (polygon plot). There are various ways of
doing this. The maps should be well-presented.
Questions to answer within your analysis. Comment on the details of your map and analyse the
results. Does this map tell you more about the relationship between Airbnb location/price and
your socio-economic variable of choice? Explain your answer.
Part 2 – Chose your own analysis
Please Note: This part of the assignment can be done on the Bay Area as a whole or you can
zoom in on one of the counties. For example, you could just focus on San Francisco.
GY476- Summative Assessment 2022/23 - GDS
2.1. Discuss which potential raster data set could add to or improve your analysis in maximum
300 words. We have looked at various ones in class. You do not need to obtain the data, just
discuss it. You can also look for other ideas at earthdata.nasa.gov.
2.2. Query OpenStreetMap data (for example bars, restaurants, subway stations)
2.1.1 Chose an amenity to query in OpenStreetMap (for example bars, restaurants, subway
stations). Source your amenity of choice in and save the data.
• Map 4: Create a heatmap of your amenity of choice and analyse it. The maps should be
well-presented.
2.1.2 Create buffers around your chosen amenity. Find out which Airbnbs are 200 metres (or
less) from your amenity of choice. How many Airbnbs are within this spatial range? Would
this help you decide where to choose an Airbnb if you were going to San Francisco?
Justify by referring to the opportunities and limitations of OSM data.
2.3 Descriptive Spatial Analysis. You need to pick one of the following three options. Only one,
and make the most of it. You must include one map (Map 5) to support your analysis.
Option 1: Smoothing & Interpolation (IDW, Heatmaps or Point Patterns of Airbnb or OSM data)
• Chose which data to focus on. If you use OSM data it must be different that the amenity
chosen in 2.1.1.
• Visualise the dataset appropriately and discuss why you have taken your specific
smoothing or interpolation approach
• How did you define nearest neighbours? Distance and other parameters?
• What do the clusters help you learn about areas of interest in the city? How are clusters
distributed geographically? What are the main characteristics of each cluster? Can you
identify some groups concentrated on particular areas?
• In what research contexts would your chosen research approach be useful? What would
you advise city decision-makers from your findings?
Option 2: Network Analysis or Routing
• For this option, you can either chose to calculate routing between different points from
data you are already working with in your project by using the R package sfnetworks or
you can download some trip data
https://data.sfgov.org/browse?category=Transportation&page=2
• Visualise the dataset appropriately and discuss why you have taken your specific
smoothing or interpolation approach
• Report your travel time findings and minimum and maximum distances to relevant
locations OR create an origin-destination matrix stating the number of trip and average
duration and explore the spatial distribution of your trips.
• In what research contexts would calculating travel times or inspecting a travel network
be useful? What would you advise city decision-makers from your findings?
Option 3: Plotting relationships between Spatial Variables
• For this option you will be using the Inside Airbnb price data together with four socio-
economic variables of your choice.
• Visualise the datasets and discuss why you have chosen your four socio-economic
variables of interest.
GY476- Summative Assessment 2022/23 - GDS
• Create a minimum of four scatter plots between av. price and 2-4 socio-economic
variables of interest and/or a cross-correlation matrix of your variables
• What do these relationships help you learn about areas of interest in the city?
• In what research contexts would your chosen research approach be useful? What would
you advise city decision-makers from your findings? Feel free to play around with more
variables if you think it can support/enhance your findings.
Resources to help you. See also suggested bibliography in slides throughout the course.
• https://www.r-bloggers.com/2017/11/programming-meh-lets-teach-how-to-write-
computational-essays-instead/
• https://rmarkdown.rstudio.com/
• https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
• https://vizual-statistix.tumblr.com/post/114850050736/i-find-the-spread-of-airbnb-to-be-
as-fascinating
• https://carto.com/blog/airbnb-impact/
• https://cran.r-project.org/web/packages/biscale/vignettes/biscale.html
Appendix
American Community Survey (ACS) 2016-2020, US Census Bureau. Observations: 1039; Variables:
472; Years: 2016-2020
Variable Description
B19013_001E Median household income in the past 12 months (in 2020 inflation-adjusted
dollars). Coded as hh_income
B02001 (list of vars) Population by race
See https://api.census.gov/data/2020/acs/acs5/variables.html
I have already recoded black (n of black people) and all_ppl_race (total
population by census tract)
B23006 (list of vars) Population by education
See https://api.census.gov/data/2020/acs/acs5/variables.html
C15002A (list of vars) Population by Sex by Education
See https://api.census.gov/data/2020/acs/acs5/variables.html
C27012 (list of vars) Population by Health insurance
See https://api.census.gov/data/2020/acs/acs5/variables.html
B08006 (list of vars) Commuting variable
See https://api.census.gov/data/2020/acs/acs5/variables.html
B09010 (list of vars) Supplementary income variables
See https://api.census.gov/data/2020/acs/acs5/variables.html
B09019 (list of vars) Household type counts
See https://api.census.gov/data/2020/acs/acs5/variables.html
B17001 (list of vars) Poverty Status
See https://api.census.gov/data/2020/acs/acs5/variables.html
B28011 (list of vars) Internet Access
See https://api.census.gov/data/2020/acs/acs5/variables.html
B99084 (list of vars) Work From Home
See https://api.census.gov/data/2020/acs/acs5/variables.html