Due Date: Sunday, March 13th @ 11:59 pm Pacific Time
Submit Project 2 into the corresponding assignment link on D2L
Value: 7% of the course grade
Penalty for late delivery: 2 points deduction up to 4 days late; no points will be given over 4
days late.
The purpose of this project is for you to apply the concepts and skills learned to explore the
datasets and questions that you are interested in spatial economics. As you have done some
preliminary research work in Project 1, you will want to identify the available datasets that
are suitable for the scope and scale of your interested area and ready to dive into some
analysis.
In this project, you will identify and import the data of your own interest in spatial economics
into R and conduct explanatory data analysis, exploratory spatial data analysis (i.e. spatial
weights and global spatial autocorrelation) and multiple linear regression.
Read through the entire document first. Next, go through the hands-on R practices that we
did in previous weeks if you have not done so to be familiar with the libraries, functions and
their arguments required in R to complete this project.
• To identify available spatial datasets for investigating the spatial economic topic area
of your interest
• To explore spatial autocorrelation using global Moran’s I and Moran scatterplot
• To conduct multiple linear regression including diagnostics and estimation of the data
• To interpret the outputs of spatial autocorrelation and multiple linear regression
This project looks to further your topic of interest into some practical exercises in spatial and
statistical analysis in R. To complete that, follow the instructions below:
1. From your chosen spatial economic research topic and variables in Project 1, identify
available datasets appropriate for investigation in spatial statistics. Import the data
into R. Focus on the main variable that you are interested in learning to start with.
Consider the spatial extent and unit of analysis so the data size is not too large to
manage (e.g. the number of units between 50 to 500 is ideal) – okay to be more; it just
will take more time to process.
SSCI 574 Project 2 – Explanatory Spatial Data Analysis & Multiple
Linear Regression
Learning Objectives
Project Description
2
Often your spatial location data (e.g. county boundaries) and attributes (e.g.
employment rates) might come from different sources and those can be joined
together before spatial autocorrelation. Some might find it easier to keep attributes as
a .csv and import into R as a data frame for regression purpose – in this case, you can
merge spatial & non-spatial data in R when you need them to be merged. Alternatively,
you can pre-processing your data in ArcGIS, joining them together (as a shapefile)
before importing into R.
For importing shapefiles, use readOGR( ) in the rgdal package. Use ??readOGR to open
the Help file in RStudio. If your data is not projected, you will have to retrieve the
geographic coordinates from polygons then use the spTransform method in the rgdal
library. If your non-spatial data contains latitude and longitude, you can use read.csv( )
or read.table( ) to import the non-spatial data first, then make your data spatial by
creating a Spatial* object (see the R handout in Week 5 for how to promote the data
spatial).
For any remaining questions about data import, search for online resources first (e.g.
https://rdocumentation.org) and post your question/issues on the Discussion Forum on
Blackboard if you still have problems. It is fine to discuss with your classmates about
your projects; in fact, you will find it beneficial to learn from each other.
2. Explore the distribution of your imported data by conducting explanatory data analysis
(EDA) in R. Run descriptive statistics to provide at least: sample size, minimum, mean,
median, maximum, and standard deviation and make a scatterplot, a histogram, and a
boxplot for your main variable(s) – doing all of the EDA described here for one main
variable is sufficient. More (e.g. running EDA for both variables that you want to know
the association with) is fine as well. We practice EDA here as you should always
examine and know your data well before any statistical or spatial analysis. Consider
transformation if the data shows non-normal distribution and show its normality after
transformation (e.g. taking ln() if highly right-skewed).
3. Explore your spatial dataset by conducting exploratory spatial data analysis (ESDA) –
specifically Moran’s I and Moran scatterplot here -- of your main variable(s). You will
build spatial weights matrix first then apply it for global Moran’s I and Moran
scatterplot. When you run Moran’s I, using Monte Carlo approach as your choice. You
may run other ESDA to learn about your data, such as doing kernel density estimation
(KDE) if you have a point dataset, but it is not required in this project.
4. Execute standard linear regression to investigate the association of the variables in the
topic of interest using lm( ) function. The number of independent variables can vary
but make sure that your final model contains only the explanatory variables that have
their partial coefficients statistically significant. If you decide to keep insignificant
explanatory variable(s) in your OLS regression, you will have to justify your decision in
the report.
5. Write a report that include the following items:
a. Introduction (1pt): A brief description of your interested spatial economic topic
and research questions, your chosen variables – what they are, spatial granularity,
units, and extent -- and datasets and their sources.
3
b. Exploratory Data Analysis (1pt): R code, their resulting table/plots, and a short
paragraph describing data distribution (i.e. central tendency and dispersion) and if
data transformation is done following the result of data distribution.
c. Exploratory Spatial Data Analysis (2pts): R code, their resulting display, and 1-2
paragraphs describing and interpreting the results. Here your results should
consist of neighbor list object detail, visualization of your spatial weights objects,
Moran’s I results, and Moran scatterplot. Describe what each of these analysis
results tells you about your data.
d. Standard linear regression (2pts): R code, the results, and a paragraph that
interpret the results.
e. Reflection (1pt): A short paragraph reflect about the experience you had when
working on this project. What do you find easy? What do you find challenging?
What questions do you still have after you complete the project? Any adjustment
you might consider, either on data or operation, to improve your experience?
Deliverables
Submit a project report with the components requested above in a Word document by the
due date. Include the information about the class number (SSCI 574), semester (Spring 2022),
project number, title and your name. Save your Project 2 report document as
Project2_[YourLastName].docx and submit it via the appropriate assignment link in D2L.
Additional Resources I: Data Hubs
Below is a list of commonly used data hubs for your reference. If you have a hard time to find
the appropriate datasets, you may consider the following sources and adopt the datasets
mentioned here to use in your project. USC Visualization Librarian Andy Rutkowski also
mentioned several databases and programs that contain spatial datasets of various scales that
might be suitable for your need.
1. City of Los Angeles GeoHub: https://geohub.lacity.org. Datasets you may consider
include, but not limited to, Los Angeles index of displacement pressure, traffic collision
or traffic accidents data.
2. COVID-19 GIS Hub: https://coronavirus-resources.esri.com. If you are interested in
understanding COVID-19 impact of our social and economic aspects of life, you might
find this data hub useful. Additionally, as I want you to make a story map for the final
presentation that combines the analysis and information for all of your projects this
semester, you might also check out how Esri utilizes its ArcGIS Story Map to tell the
story of its work in COVID-19 (https://www.esri.com/about/newsroom/blog/gis-to-
%20achieve-equitable-speedy-vaccine-distribution/ ).
3. The U.S. Census: Census Bureau not only offers spatial data (TIGER/Line data), but also
include various socio-economic and demographic factors that are surveyed every year
in various census administrative levels you can download for use (e.g. American
Community Survey 5-year estimates). Use advanced data search to find the variables
4
within the right scale (spatial resolution) and spatial extent that can answer to the
questions you ask. https://data.census.gov/cedsci/advanced
4. IPUMS: https://ipums.org. As a part of the Institute for Social Research and Data
Innovation at the University of Minnesota, IPUMS provides census and survey data
from the U.S. and around the world. IPUMS integrates the census type data to make it
easy to study and research. You may also want to check the ‘ABOUT’ tab if you look for
the data analysis type of employment in the near future
(https://www.ipums.org/about/jobs).
Additional Resources II: Creating neighbor object list for a point data
If your need to contains latitude and longitude data three columns including latitude,
longitude and the average math score of schools in one district. We can import this data
(.csv), transform/promote it to a spatial object, and assign its datum WGS84:
To create an object that describes the neighbor relationship from point datasets, consider
using a different spatial relationship than the contiguity which we used in class demo in
constructing spatial weights matrix. The code here shows you how to apply the k nearest
neighbor (knn) method:
The resulting neighbor object is in a ‘knn’ class. You can then convert knn into a more
generic class of neighbor object ‘nb’ before converting it to the listw object as the spatial
weights matrix using nb2listw( ):
5