DATA2001-无代写
时间:2023-04-04
The University of Sydney Page 1
DATA2001: Data Science,
Big Data and Data Diversity
W4: Declarative Data Analysis
with SQL
Presented by Ali Anaissi
School of Computer Science
The University of Sydney Page 2
Overview of Last Week
The University of Sydney Page 3
Where Do We Get Data From?
– You or your organization might have it already, or a colleagues
provides you access to data.
– Typical exchange formats: CSV, Excel, XML/JSON
– Or: Download from an online data server
– Still typically in CSV or Excel etc, but now problems with meta-data
– Or: Scrap the web yourself or use APIs of resources
– Cf. textbook, chapter 9
The University of Sydney Page 4
Murray River Basin Data Set
– Water measurements:
– automatic monitoring stations
– that are distributed over a larger area
• (say the Murray River catchment basin)
– that periodically send their measured values to a central authority:
– Time-series data of water level, water flow, water temperature, salinity
(via measuring electric conductivity) or other hydraulic properties
The University of Sydney Page 5
Relational Databases
– Informal Definition:
A relation is a named, two-dimensional table of data
– Table consists of rows (record) and columns (attribute or field)
– Example:
Attributes (also: columns, fields)
Tuples
(rows,
records)
5312666
5366668
5309650
Jones
Smith
Jin
ajon1121@cs
smith@mail
ojin4536@it
Student
namesid login
m
m
f
123 Main St
45 George
19 City Rd
addressgender
The University of Sydney Page 6
This Week: RDBMS+Python and Data Analysis with SQL
– Today's goal is to store the data available as CSV files in a
relational database using Python
– and to use SQL to summarise and analyse that data
– Relational data model is the most widely used model today
– Main concept: relation, basically a table with rows and columns
– This sounds like a spreadsheet, but as we have seen last week,
it has some differences
The University of Sydney Page 7
SQL (Structured Query Language)
The University of Sydney Page 8
SQL – The Structured Query Language
– SQL is the standard declarative query language for RDBMS
– Describing what data we are interested in, but not how to retrieve it.
– Supported commands from roughly two categories:
– DDL (Data Definition Language)
• Create, drop, or alter the relation schema
• Example:
CREATE TABLE name ( list_of_columns )
– DML (Data Manipulation Language)
• for retrieval of information also called query language
• INSERT, DELETE, UPDATE
• SELECT … FROM … WHERE
The University of Sydney Page 9
Table Constraints and Relational Keys
– When creating a table, we can also specify Integrity Constraints for columns
– eg. domain types per attribute, or NULL / NOT NULL constraints
– Primary key: unique, minimal identifier of a relation.
– Examples include employee numbers, social security numbers, etc. This is how we
can guarantee that all rows are unique.
– Foreign keys are identifiers that enable a dependent relation (on the many side
of a relationship) to refer to its parent relation (on the one side of the
relationship)
– Must refer to a candidate key of the parent relation
– Like a `logical pointer’
– Keys can be simple (single attribute) or composite (multiple attributes)
The University of Sydney Page 10
SQL Domain Constraints
• SQL supports various domain constraints to restrict attribute to valid domains
• NULL / NOT NULL whether an attribute is allowed to become NULL (unknown)
• DEFAULT to specify a default value
• CHECK( condition ) a Boolean condition that must hold for every tuple
in the db instance
Example:
CREATE TABLE Student
(
sid INTEGER PRIMARY KEY,
name VARCHAR(20) NOT NULL,
gender CHAR CHECK (gender IN ('M,'F','T')),
birthday DATE,
country VARCHAR(20),
level INTEGER DEFAULT 1 CHECK (level BETWEEN 1 and 5)
);
The University of Sydney Page 11
Example: Relational Keys
sid
Student Enroll Units_of_study
31013 I2005 CR I2120 DB Intro 431013 John 120
Primary key identifies each tuple of a
relation.
Foreign key is a (set of) attribute(s) in one relation that
`refers’ to a tuple in another relation (like a `logical
pointer’).
name gradesid ucode titleucode credit_pts
Composite Primary Key consisting of
more than one attribute.
The University of Sydney Page 12
SQL DML Statements
– Insertion of new data into a table / relation
– Syntax:
INSERT INTO table [“(”list-of-columns“)”] VALUES “(“ list-of-expression “)”
– Example:
INSERT INTO Students (sid, name) VALUES (53688, ‘Smith’)
– Updating of tuples in a table / relation
– Syntax:
UPDATE table SET column“=“expression {“,”column“=“expression}
[ WHERE search_condition ]
– Example: UPDATE students
SET gpa = gpa - 0.1
WHERE gpa >= 3.3
– Deleting of tuples from a table / relation
– Syntax:
DELETE FROM table [ WHERE search_condition ]
– Example:
DELETE FROM Students WHERE name = ‘Smith’
The University of Sydney Page 13
DB Creation / Data Loading
from Python with Jupyter
The University of Sydney Page 14
Data Storing
– Data is ready?
– We have analysed our given data set
– Cleaned it
– Transformed it and created a corresponding relational database
– Next, we want to store the given data in our database.
– Main approaches:
1. Command line tools
2. Python loader
3. (Combination of Python loader and stored procedures)
The University of Sydney Page 15
Approach 1: PSQL Data Loader
– Postgresql 's psql provides a command to load data directly
from a CSV file into a database table
\COPY tablename FROM filename CSV [HEADER] [NULL '…']
– Many further options
– Try \help COPY
– Pros:
– Relatively fast and straight-forward
– No programming needed
– Cons:
– Only 1:1 mapping of CSV to tables; no data cleaning or transformation
– Stops at the first error…
The University of Sydney Page 16
– Use pgconnect() to create a connection conn
– Example: Creating a table and loading some data
– Define a string with the appropriate CREATE TABLE command
– Use conn.execute(string) to do this in the dbms
– Obtain data as dataframe df (eg df = pd.read_csv(file))
– Execute df.to_sql on conn
– Details covered in labs
– Pros: flexibility; use pandas data cleaning and transformation tools
– Cons: needs to be hand-coded
Approach 2: Python Loading Code
The University of Sydney Page 17
– Example: Creating a table and loading some data
– Pros: Full flexibility; data cleaning and transformation possible
– Cons: Has to be hand-coded for each case; requires good-quality CSV data
Approach 2: Python Loading Code
utility function which we will provide in the tutorial
The University of Sydney Page 18
Sidenote on Database Connections
– In order to connect Python with a database server, three things are needed:
– A database driver such as sqlalchemy and/or psycopg2
– The hostname and user login details to the database server
– Calling the database driver’s connect function with those parameters
– This results in a database connection, which is then the context for any further
SQL queries and database calls
– Eg. the ‘con=conn’ parameter in the data.to_sql() call on the prev slide
– Important: a database server supports only a limited number of connections
– Every database connection should be closed at the end of your program!
– Do not create multiple connections in parallel, e.g. by running more than one
Jupyter notebook in different browser tabs.
The University of Sydney Page 22
Recap Week 3 – Joins
The University of Sydney Page 23
JOIN: Querying Multiple Tables
– Often data that is stored in multiple different relations must be combined
– We say that the relations are joined
– FROM clause lists all relations involved in the query
– join-predicates can be explicitly stated in the where clause; do not forget it!
– Examples:
– Produces the cross-product Measurements x Stations (that is, every combination of
rows from the two tables)
SELECT *
FROM Measurements, Stations;
– Find the combinations of rows with filter to make sure that particular fields match
SELECT *
FROM Measurements, Stations
WHERE Measurements.stationId = Stations.stationId;
The University of Sydney Page 24
Joining: Combining Data from Multiple Tables
For example, the following query lists all stations belong to Vi c t or i a
Gove r nme nt
set search_path to WaterInfo;
SELECT *
FROM Stations, Organisations
WHERE Stations.orga = Organisations.code
AND Organisations.name = 'Victoria Government';
All tables accessed by the query are listed in the FROMclause, separated by
comma.
The University of Sydney Page 25
Recap from Wk3: Join Operator
SQL offers join operators to directly formulate joins.
- makes queries a bit better readable
Example: Same query than on previous slide using JOIN operator
SELECT *
FROM Stations JOIN Organisations ON (orga=code)
WHERE Organisations.name = 'Victoria Government';
The University of Sydney Page 26
Recap from Wk3: Natural Joins
A natural join between two tables combines those rows whose values agree in all
common attributes, ie. those that are named the same. These common columns are
also shown only once in the result.
SELECT *
FROM Measurements NATURAL JOIN Sensors;
Careful: If there is no common attribute in two tables; the query produces
a cross join, also called the cross product of both tables (undesired result).
The University of Sydney Page 27http://dieswaytoofast.blogspot.com.au/2013/05/sql-joins-visualized.html
The University of Sydney Page 28
SQL Subqueries
• A subquery is a query within a query. You can create subqueries within your SQL
statements. These subqueries can reside in the WHERE clause, the FROM clause,
or the SELECT clause.
SELECT *
FROM Measurements s
WHERE sensor IN ( SELECT sensor FROM Sensors );
SELECT *
FROM Measurements
WHERE sensor='temp’ AND obsvalue>( SELECT AVG(obsvalue)
FROM Measurements
WHERE sensor='temp' );
The University of Sydney Page 29
Summarising Data with SQL
The University of Sydney Page 30
Summarising a Database with SQL
– Many useful features to summarise data with SQL
– Complex filtering
– Data categorization and aggregation
– Grouping
– Ranking
– Etc.
The University of Sydney Page 31
Recap from Wk3: SQL Aggregate Functions
SQL Aggregate Function Meaning
COUNT(attr) ; COUNT(*) Number of Not-null-attr ; or of all values
MIN(attr) Minimum value of attr
MAX(attr) Maximum value of attr
AVG(attr) Average value of attr (arithmetic mean)
MODE() WITHIN GROUP (ORDER BY attr) mode function over attr
PERCENTILE_DISC(0.5) WITHIN GROUP
(ORDER BY attr)
median of the attr values
… …
SELECT AVG(obsvalue)
FROM Measurements
WHERE stationid = 409001 AND sensor = ‘level’;
Example:
The University of Sydney Page 32
SQL Grouping
– So far, we’ve applied aggregate operators to all (qualifying) tuples. Sometimes,
we want to apply them to each of several groups of tuples.
– Example: Find the average temperature measured at each day
SELECT obsdate, AVG(temp)
FROM Measurements
Measurements
SELECT obsdate, AVG(temp)
FROM Measurements
GROUP BY obsdate
obsdate temp
20 Mar 2019 26 C
20 Mar 2019 24 C
21 Mar 2019 28 C
obsdate temp
20 Mar 2019 26 C
20 Mar 2019 26 C
21 Mar 2019 26 C
obsdate temp
20 Mar 2019 25 C
21 Mar 2019 28 C ✓
The University of Sydney Page 33
Queries with GROUP BY and HAVING
– In SQL, we can “partition” a relation into groups according to the value(s) of one
or more attributes:
SELECT [DISTINCT] target-list
FROM relation-list
WHERE qualification
GROUP BY grouping-list
HAVING group-qualification
– A group is a set of tuples that have the same value for all attributes in grouping-
list.
– Note: Attributes in select clause outside of aggregate functions must appear in
the grouping-list
– Intuitively, each answer tuple corresponds to a group, and these attributes must have
a single value per group.
The University of Sydney Page 34
Group By Overview
[Kifer/Bernstein/Lewis 2006]
The University of Sydney Page 35
Example: Filtering Groups with HAVING Clause
– GROUP BY Example:
– What was the average temperature at each date?
SELECT obsdate, AVG(temp)
FROM Measurements
GROUP BY obsdate
– HAVING clause: can further filter groups to fulfil a predicate
– Example:
SELECT obsdate, AVG(temp)
FROM Measurements
GROUP BY obsdate
HAVING COUNT(temp) >= 10
– Note: Predicates in the having clause are applied after the formation of groups
whereas predicates in the where clause are applied before forming groups
The University of Sydney Page 36
Query-Clause Evaluation Order
FROM
identifies involved tables & joins
WHERE
filtering rows fulfilling given condition(s)
GROUP BY
organise rows in groups wrt. attribute(s)
HAVING
filter groups meeting condition(s)
SELECT
identifies attributes / aggregates
ORDER BY
sort result rows
Result
….
The University of Sydney Page 37
uos_code sid emp_id mark title
COMP5138
COMP5138
COMP5138
COMP5138
ISYS3207
ISYS3207
SOFT3000
INFO2120

1001
1002
1003
1004
1002
1004
1001
1005

10500
10500
10500
10500
10500
10505
10505
10500

60
55
78
93
67
80
56
63
….
RDBMS
RDBMS
RDBMS
RDBMS
IS Project
IS Project
C Prog.
DBS 1

cpts.
6
6
6
6
4
4
6
4

lecturer
10500
10500
10500
10500
10500
10505
10505
10500

Query Evaluation Example
SELECT uos_code as unit_of_study, AVG(mark)
FROM Assessment NATURAL JOIN UnitOfStudy
WHERE credit_points = 6
GROUP BY uos_code
HAVING COUNT(*) > 2
– Find the average marks of 6-credit point courses with more than 2 results
1. Assessment and UnitOfStudy are joined
2. Tuples that fail the
WHERE condition
are discarded
The University of Sydney Page 38
Query Evaluation Example (cont’d)
3. remaining tuples are partitioned into groups
by the value of attributes in the grouping-list.
4. Groups which fail
the HAVING condition
are discarded.
5. ONE answer tuple is generated per group uos_code AVG(..)
COMP5138
INFO5990
56
40.5
uos_code sid emp_id mark title
COMP5138
COMP5138
COMP5138
COMP5138
1001
1002
1003
1004
10500
10500
10500
10500
60
55
78
93
RDBMS
RDBMS
RDBMS
RDBMS
cpts.
6
6
6
6
lecturer
10500
10500
10500
10500
SOFT3000 1001 10505 56 C Prog. 6 10505
INFO5990

1001

10505

67
….
IT Practice

6

10505

Question: What happens if we have NULL values in grouping attributes?
The University of Sydney Page 39
Transforming and Cleaning Data
using Python
The University of Sydney Page 40
Issues encountered at previous exercises
– Interpretation of data format and meta-data
– Differences in naming conventions
– Excel headers with spaces and quotes, which both are not allowed to
DBMS
– Inconsistent or missing data entries
– 'shape' of data
The University of Sydney Page 41
Data Cleaning in Python
– Yes, there are powerful ETL tools out there, but we do it for free in Python:
(1) type and name conversion
(2) filtering of missing or inconsistent data
(3) unifying semantic data representations
(4) matching of entries from different sources
– Last week's clean() function deals with Tasks (1) and (2)
– int() creates integer objects, e.g., -1, 101
– float() creates floating point object, e.g., 3.14, 2.71
– datetime.strptime() creates datetime objects from strings
– Filters missing / wrongly formatted data and replaces with default value
– For more complex cases (3) and (4), you would need special code though
The University of Sydney Page 42
A function to convert values
Use “not a number” as default value
numpy knows to ignore for some stats
The University of Sydney Page 44
Review
The University of Sydney Page 45
Tips and Tricks
– Real data is 'dirty' – data cleaning and transformation essential
– Database systems are great for shared, persistent storage of
structured data, and also for consistent updating ('live' data)
– But some caveats:
– Schema-first
– Relational model quite restrictive (1NF, no lists, collections etc)
– Not too intuitive; 1:1 mapping from spreadsheets doomed to fail
– Type-mismatches between programming languages and SQL
– Needs to be installed and maintained
(though much better nowadays for SQLite and PostgreSQL)
– What's the benefit?
– haring! Large data sets! Querying will give us some leverage too.
The University of Sydney Page 46
Online Book
don't forget to end session
when finished reading
The University of Sydney Page 47
Many Good Python Resources
– Hard to make recommendations given different backgrounds
– Look online, there are many free resources and example code
– A few lists:
– https://www.fullstackpython.com/best-python-resources.html
– https://www.quora.com/Learning-Python/How-should-I-start-learning-
Python-1

essay、essay代写