AM 3-SQL代写|学霸联盟

AM 3-SQL代写

时间：2023-03-20

3/20/23, 1:12 AM 3 - Colaboratory
https://colab.research.google.com/drive/1qgHcipX4zB8KhbKGLsFGU3wNrlBjRgNG?authuser=1#scrollTo=wA9HrB3iNtzZ&printMode=true 1/7
You get the full 10 points if your notebook runs from start to nish without errors.
HINT: runtime -> disconnect and delete runtime then runtime -> run all
Part 0 (10 Points):
NYPD Crime Reports Dataset
The goal is to analyze the dataset for frequencies across different timeframes to address the questions listed below--using all the valid felony,
misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 through the end of 2016.
The dataset is available at: https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i
Part 1: (60 Points [3 sections])
Below are the descriptions of the attributes, and their datatypes.
Column Description Data Type
CMPLNT_NUM Randomly generated persistent ID for each complaint bigint(20)
CMPLNT_FR Exact datetime of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) datetime
CMPLNT_TO Ending datetime of occurrence for the reported event, if exact time of occurrence is unknown datetime
RPT_DT Date event was reported to police date
KY_CD Three digit offense classication code char(3)
PD_CD Three digit internal classication code (more granular than Key Code) char(3)
PD_DESC Description of internal classication corresponding with PD code (more granular than Offense Description) varchar(60)
CRM_ATPT_CPTD_CD Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely enum('COMPLETED','ATTEMPTED')
LAW_CAT_CD Level of offense: felony, misdemeanor, violation enum('FELONY','MISDEMEANOR','VIOLATION
JURIS_DESC Jurisdiction responsible for incident. Either internal, like Police, Transit, and Housing; or external, like Correction, Port Authority, etc. varchar(80)
BORO_NM The name of the borough in which the incident occurred varchar(20)
ADDR_PCT_CD The precinct in which the incident occurred char(3)
LOC_OF_OCCUR_DESC Specic location of occurrence in or around the premises; inside, opposite of, front of, rear of enum('FRONT OF','INSIDE','OPPOSITE OF','OU
PREM_TYP_DESC Specic description of premises; grocery store, residence, street, etc. varchar(80)
PARKS_NM Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) varchar(80)
HADEVELOPT Name of NYCHA housing development of occurrence, if applicable varchar(80)
X_COORD_CD X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) char(9)
Y_COORD_CD Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) char(9)
Latitude Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) double
Longitude Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) double
Additional Notes: https://www1.nyc.gov/assets/nypd/downloads/pdf/analysis_and_planning/incident_level_data_footnotes.pdf
Attribute Description
# Install the SQLAlchemy library if it is not installed
!sudo apt-get install python3-dev libmysqlclient-dev > /dev/null
!pip install mysqlclient > /dev/null
!sudo pip3 install -U sql_magic > /dev/null
!pip install psycopg2-binary  > /dev/null
!pip install -q geopandas sqlalchemy > /dev/null
For your convenience, the dataset is stored in the nypd database on the MySQL Server jsedocc7.scrc.nyu.edu . You can connect and see
the dataset using your standard username and password ( student , 0b+5bJWvXoA ) using the following code:
MySQL Database
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
from sqlalchemy import create_engine
import geopandas as gpd
3/20/23, 1:12 AM 3 - Colaboratory
https://colab.research.google.com/drive/1qgHcipX4zB8KhbKGLsFGU3wNrlBjRgNG?authuser=1#scrollTo=wA9HrB3iNtzZ&printMode=true 2/7
import seaborn as sns
conn_string = 'mysql://{user}:{password}@{host}:{port}/{db}?charset=utf8'.format(
    user='students',
    password='0b+5bJWvXoA',
    host = 'jsedocc7.scrc.nyu.edu',
    port     = 3306,
    encoding = 'utf-8',
    db = 'nypd'
)
engine = create_engine(conn_string)
# Prepare sql_magic library that enable to query to database easily.
%reload_ext sql_magic
%config SQL.conn_name = 'engine'
Query started at 03:59:42 AM UTC; Query executed in 0.02 m
Tables_in_nypd
0 assault
1 felony_info
2 gl
3 glm
4 gurglary
5 jurisdiction_codes
6 murder
7 newTable
8 nypd
9 nypd_n
10 offense_codes
11 penal_codes
12 rape
13 roberry
%%read_sql
show tables;
Query started at 04:00:02 AM UTC; Query executed in 0.01 m
KY_CD OFNS_DESC
0 101 MURDER & NON-NEGL. MANSLAUGHTER
1 102 HOMICIDE-NEGLIGENT-VEHICLE
2 103 HOMICIDE-NEGLIGENT,UNCLASSIFIE
3 104 RAPE
4 105 ROBBERY
... ... ...
69 676 NEW YORK CITY HEALTH CODE
70 677 OTHER STATE LAWS
71 678 MISCELLANEOUS PENAL LAW
72 685 ADMINISTRATIVE CODES
73 881 OTHER TRAFFIC INFRACTION
74 rows × 2 columns
%%read_sql
SELECT * FROM offense_codes;
3/20/23, 1:12 AM 3 - Colaboratory
https://colab.research.google.com/drive/1qgHcipX4zB8KhbKGLsFGU3wNrlBjRgNG?authuser=1#scrollTo=wA9HrB3iNtzZ&printMode=true 3/7
Query started at 04:00:06 AM UTC; Query executed in 0.01 m
CMPLNT_NUM CMPLNT_FR CMPLNT_TO RPT_DT KY_CD PD_CD CRM_ATPT_CPTD_CD LAW_CAT_CD JURISDICTION_CODE BORO_NM ...
0 100000065 2018-09-2223:35:00 NaT
2018-
09-23 110 441 COMPLETED FELONY 0 MANHATTAN ...
1 100000228 2012-02-0818:00:00
2012-02-08
19:00:00
2012-
02-09 361 639 COMPLETED MISDEMEANOR 0 BROOKLYN ...
2 100000272 2016-01-3100:40:00
2016-01-31
00:50:00
2016-
01-31 236 782 COMPLETED MISDEMEANOR 0 BRONX ...
3 100000426 2019-10-1622:00:00
2019-10-17
12:00:00
2019-
10-17 341 343 COMPLETED MISDEMEANOR 0 BROOKLYN ...
4 100000431 2018-05-1015:00:00
2018-05-14
15:00:00
2018-
05-14 351 258 COMPLETED MISDEMEANOR 0 MANHATTAN ...
5 100000480 2010-01-2617:00:00 NaT
2010-
01-27 361 639 COMPLETED MISDEMEANOR 0 BRONX ...
6 100000495 2017-11-2420:10:00 NaT
2017-
11-24 235 511 COMPLETED MISDEMEANOR 0 BROOKLYN ...
7 100000589 2009-11-2013:00:00
2009-11-29
13:20:00
2009-
12-02 344 101 COMPLETED MISDEMEANOR 0 BRONX ...
8 100000864 2007-11-0621:00:00
2007-11-07
07:00:00
2007-
11-07 351 259 COMPLETED MISDEMEANOR 0 QUEENS ...
9 100000926 2015-07-1214:00:00
2015-07-12
15:00:00
2015-
07-13 341 338 COMPLETED MISDEMEANOR 0 BROOKLYN ...
10 rows × 22 columns
%%read_sql
SELECT * FROM nypd LIMIT 10;
%%read_sql
SELECT * FROM nypd INNER JOIN offense_codes ON nypd.KY_CD=offense_codes.KY_CD LIMIT 10;
3/20/23, 1:12 AM 3 - Colaboratory
https://colab.research.google.com/drive/1qgHcipX4zB8KhbKGLsFGU3wNrlBjRgNG?authuser=1#scrollTo=wA9HrB3iNtzZ&printMode=true 4/7
Query started at 04:00:10 AM UTC; Query executed in 0.01 m
CMPLNT_NUM CMPLNT_FR CMPLNT_TO RPT_DT KY_CD PD_CD CRM_ATPT_CPTD_CD LAW_CAT_CD JURISDICTION_CODE BORO_NM ... SUSP_R
0 100131360 2020-11-2011:48:00 None
2020-
11-20 101 None COMPLETED FELONY None None ... N
1 100157269 2013-04-2212:20:00 None
2013-
04-22 101 None COMPLETED FELONY None None ... N
2 100209510 2015-04-2720:27:00 None
2015-
04-27 101 None COMPLETED FELONY None None ... BLA
3 100344647 2016-05-2522:15:00 None
2016-
05-25 101 None COMPLETED FELONY None None ... BLA
4 100435375 2010-05-1015:55:00 None
2010-
05-10 101 None COMPLETED FELONY None None ... N
5 100464173 2018-07-1407:25:00 None
2018-
07-14 101 None COMPLETED FELONY None None ...
WH
HISPA
6 100505526 2020-08-2422:40:00 None
2020-
08-24 101 None COMPLETED FELONY None None ... BLA
7 100579560 2007-05-1714:40:00 None
2007-
05-17 101 None COMPLETED FELONY None None ... BLA
8 100621039 2008-09-0812:00:00 None
2008-
09-08 101 None COMPLETED FELONY None None ... N
9 100879104 2019-07-0501:14:00 None
2019-
07-05 101 None COMPLETED FELONY None None ... BLA
10 rows × 24 columns
You are asked to analyze the dataset and provide answers for the questions below. You can answer the questions using Python, SQL, or a
combination of them. You are strongly encouraged though to do most of the aggregation calculation with SQL. Fetching the whole dataset in
memory and doing all the aggregation calculations in Pandas may cause signicant problems, as you may often run out of memory.
Questions
The NYPD calls the following types of felonies as the "big-7" crimes, as they tend to affect most the quality of life:
GRAND LARCENY
ROBBERY
FELONY ASSAULT
BURGLARY
GRAND LARCENY OF MOTOR VEHICLE
RAPE
MURDER & NON-NEGL. MANSLAUGHTER
Focus on the big-7 felonies. Report the number of these crimes over time, from 2006 till 2016, broken down by type of felony. Focus on
reporting the total number of crimes per category, on a per month and on per year basis. Generate the associated plot.
Hint 1: The type of felony is included in the OFNS_DESC column. You can use the IN command in SQL to limit your results to these offenses, or
use an OR clause. Alternatively, you can use the .isin() command in Pandas.
Hint 2: If you would like to move the legend outside the plot, you can use the following command: df.plot(...).legend(bbox_to_anchor=
(1, 0.75)) . The two values for the parameter bbox_to_anchor describe the x/y location of the legend, compared to the plot. Experiment with
different values to understand the effect.
Question 1: Analysis of the "Big-7" felonies (20 points)
Next, break down the different big-7 felonies byhour of day.
Question 2a: Show the number of big-7 felonies, broken down by hour of day. (e.g., 6049 reported robberies at 5am). Generate a bar plot
with the results.
Question 2b: Show the number of big-7 felonies, broken down by hour of day. To allow for easier comparison among felonies, normalize
the data by dividing by the total incidents for each type of felony. (e.g., 2.8% of the reported robberies happen at 5am). Generate a bar plot
with the results.
Hint 1: If you have a datetime column in your dataframe (say df.CMPLNT_FR ), you can extract the day of the week and the hour, using the
following commands.
  categs = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
  df['DAY_OF_WEEK'] = pd.Categorical(df.complaint_datetime.dt.day_name(), categories = categs, ordered=True)
  df['HOUR'] = df.complaint_datetime.dt.hour
These commands will create two new columns, DAY_OF_WEEK and HOUR . The DAY_OF_WEEK is encoded as a ordered categorical variable, in
order to sort properly. The HOUR is a numeric variable.
Question 2: Analyze the Big-7 felonies by hour of the day. (20 points)
3/20/23, 1:12 AM 3 - Colaboratory
https://colab.research.google.com/drive/1qgHcipX4zB8KhbKGLsFGU3wNrlBjRgNG?authuser=1#scrollTo=wA9HrB3iNtzZ&printMode=true 5/7
Alternatively, you can use the commands HOUR() and DAYOFWEEK() in SQL, to extract these values.
Hint 2: When creating plots, you may prefer to extract the plot for each column of the dataframe into a separate plot. The option
subplots=True for the df.plot() command provides that capability. (See the related documentation.) If you use the subplots=True , you
can also use the layout=(num_rows, num_columns) option to specify the number of rows ( num_rows ) and the number of columns
( num_columns ) to use for the grid arrangement.
Hint 3: In normalizing data and when generating plots, you may nd the .T command very handy (shorthand for .transpose() ). The .T
command returns the transpose of the dataframe, i.e., returns a dataframe where the columns become rows and vice versa.
↳ 已隐藏 5 个单元格
You will analyze the spatial density for the following types of felonies:
GRAND LARCENY
GRAND LARCENY OF MOTOR VEHICLE
FELONY ASSAULT
MURDER & NON-NEGL. MANSLAUGHTER
Use the Latitude and Longitude variables from the dataset to identify the location of each crime. (FYI, for victim privacy, the reported rapes
do not have geocoding variables associated with them.) Limit your analyses to Manhattan only (see the BORO_NM attribute), felonies, and to
dates between 2006 and 2016 (inclusive on both ends).
Hint 1: Remember that if you do a 2D density estimation, you can speed up the computation by taking just a sample of the data that you analyze
using the dataframe.sample() command.
Hint 2: If you would like to plot a map of Manhttan under your density plot, you can use the shapele of NYC neighborhoods from NYC Open
Data, and keep only Manhattan neighborhoods. See the code below.
Question 3: Spatial Analysis (20 points)
# Dataset from NYC Open Data: https://data.cityofnewyork.us/City-Government/Neighborhood-Tabulation-Areas/cpf4-rkhq
shapefile = 'https://data.cityofnewyork.us/api/geospatial/cpf4-rkhq?method=export&format=GeoJSON'
# Load the shapefile
df_nyc = gpd.GeoDataFrame.from_file(shapefile)
# Limit the data to only Manhattan neighborhoods
df_manhattan = df_nyc.query( "boro_name =='Manhattan' ")
# Create a plot
manhattan_plot = df_manhattan.plot(linewidth=0.5, color='White', edgecolor='Black', figsize=(15, 10))
# If you want to plot on top of the Manhattan,
# you need to recreate the Manhattan plot in each cell
manhattan_plot = df_manhattan.plot(linewidth=0.5, color='White', edgecolor='Black', figsize=(15, 10))
Yelp + NYC Restaurant Inspection Data
In this question, we want to understand the relationship between inspections and restaurant ratings on Yelp. This requires us to get the two
data sources and merge them together.
Go to https://www.yelp.com/developers/
create an app, and get the key:
import requests
key = 'iI4dsdXt-....' <- replace with your key
auth_header = {'Authorization': 'Bearer ' + key}
Use the "Business Search" part of the Yelp API. See https://www.yelp.com/developers/documentation/v3/business_search for documentation
Part 2 (30 points)
[ ]
3/20/23, 1:12 AM 3 - Colaboratory
https://colab.research.google.com/drive/1qgHcipX4zB8KhbKGLsFGU3wNrlBjRgNG?authuser=1#scrollTo=wA9HrB3iNtzZ&printMode=true 6/7
Colab 付费产品 - 在此处取消合同
url = 'https://api.yelp.com/v3/businesses/search'
# We search for restaurants within 2 miles of Stern
# Check the documentation for other parameters
parameters = {
    "location": "44 West 4th Street, New York, NY 10012",
    "radius": int(2*1609.34),
    "sort_by": "distance",
    "limit" : 50
}
# Issue the authenticated request
resp = requests.get(url, headers=auth_header, params=parameters)
data = resp.json()
You will need to create a Pandas DataFrame with data['businesses'] .
While you could create the DataFrame using df = pd.DataFrame(data['businesses']) the format is nicer if we use df =
pd.json_normalize(data['businesses']) .
The json_normalize command is able to parse the JSON response and create a dataframe. For more documentation see
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html.
After merging the two sources, you should have the restaurant name, address, phone number, and zipcode.
# ADD YOUR CODE HERE
!curl 'https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD' -o restaurant.csv
import pandas as pd
restaurants = pd.read_csv("restaurant.csv",
                          encoding='utf_8',
                          dtype = 'unicode',
                          parse_dates = True,
                          infer_datetime_format = True,
                          low_memory=False)
# ADD YOUR CODE TO MERGE HERE
Finally, evaluate the relationship between Yelp rating and inspections.
You may want to use 10 different locations.
Note that this is an open-ended question, so you are asked to explore this.
# ADD YOUR CODE HERE
3/20/23, 1:12 AM 3 - Colaboratory
https://colab.research.google.com/drive/1qgHcipX4zB8KhbKGLsFGU3wNrlBjRgNG?authuser=1#scrollTo=wA9HrB3iNtzZ&printMode=true 7/7