DATA2001-sql代写|学霸联盟

DATA2001-sql代写

时间：2023-03-22

DATA2001: Data Science
Introduction 2
Background 2
Data Exploration With Python 3
Types of Data and Level Management 3
Descriptive Statistics with Pandas 7
Accessing Data in RDBMS, SQL 8
Approaches 8
SQL 11
Declarative Data Analysis with SQL 12
SQL 12
DB Creation / Data Loading 14
Indexing and Data Partitioning 14
Indexing 17
Distributed Data Processing 20
Scraping Web Data 21
Extracting data from HTML 23
Storing Scraped Web Data 24
Semi Structured Data and Web Services and NoSQL 26
Web Services 26
Storing Semi-Structured Data 29
NoSQL 30
Text Data Processing 31
Geospatial Data 33
Database Support for Geo-Spatial Data 35
Geopandas 36
Data Exchange of Spatial Data Using XML and JSON 37
Time Series Data 37
Temporal Databases 37
Temporal Support in Current Database Systems 39
Image Data Processing 40
ROI Segmentation by Thresholding and Histogram 43
Feature Extraction 44
Big Data 45
The 3 Big Vs: 45
Analysing Big Data: 45
Dataflow Oriented Analytics Platforms 48
1.Introduction
Background
Structured data is clearly defined and searchable types of data while unstructured data is
usually stored in its native format.
Structured data is usually text only. Unstructured consists more of images, videos, audio, emails
etc.
The 3 big V’s of big data:
- Data Volume (The amount of data)
- There is a certain volume from which onwards manual processing is not an
option anymore.
- Date Velocity (How quickly the data is coming in)
- Data Variety (Data different from one application to another)
- Data can be very different from one application to another, much of it being
unstructured
More Vs of big data:
- Validity (Data quality)
- Variability (Data consistency)
- Veracity (Data accuracy/trustworthiness)
- Value
Sources of big data:
- Human generated (Photos, posts, likes, etc)
- Machine generated (Communication logs, Internet of things, etc.)
Data Exploration With Python
Types of Data and Level Management
Categorical Data:
- A categorical variable is also known as a discrete or qualitative variable and can have
two or more categories.
- It is further divided into two variants, nominal and ordinal.
- These variables are sometimes coded as numerical values, or strings.
Nominal Data:
- This is an unordered category data. This type of variable may be label coded in numeric
form but these numerical values have no mathematical interpretation and are just
labeling to denote categories. For example: black, red, white can be coded as 1,2,3.
- Values are names.
- No ordering implied
Dichotomous Data:
- A dichotomous is a type of nominal data that can only have two possible values e.g. true
or false, or present and absent. These are sometimes referred to as binary or boolean
variables. Not always boolean though.
- Correct / Incorrect
- True or False
Ordinal Data:
- This is ordered categorical data in which there is strict order for comparing the values, so
labelling as numbers is not completely arbitrary. For example, human height (small,
medium, high) can be coded into numbers. Small = 1, Medium = 2, High = 3.
- Values are ordered
- No distance is implied.
Interval Data:
- It is a variable in which the interval between values has meaning and there is no true
zero value.
- Values encode differences
- Equal intervals between values
- No true zero
- Addition is defined
- Examples: Celsius temperature scale
- Cant express no temperature.
Ratio Data:
- A variable that might have a true value of zero and represents the total absence of a
variable being measured. For example, it makes sense to say a kelvin temperature of
100 is twice as hot as a kelvin temperature of 50 because it represents twice as much
thermal energy. (Can't do this with Fahrenheit temperatures of 100 and 50).
- Values encode differences
- Zero is defined
- Ratio meaningful
- Example: Length, weight, pressure, income
Text data or images require additional interpretation.
Data Acquisition:
- File Access
- You or your organization may already have a data set or you can download it
from the web from a data server.
- Typical formats include CSV, Excel, XML
- Programmatically
- Scraping the web (HTML)
- Or using APIs of web services (XML/JSON)
- Database Access
Real data is often dirty. So it's important to do some data cleaning and transforming first.
Typical cleaning steps involved:
- Type and name conversion
- Filtering of missing or inconsistent data
- Unifying semantic data representations
- Potentially also: Scaling, Normalization, Dimensionality Reduction
Approach 1: Specific Data Cleaning Tools
- Tools developed by third parties that make it easy to inspect and clean data.
- Very helpful, especially for smaller data sets.
Approach 2: Jupyter Notebooks and Python
- Write code with Python and its libraries to check for and deal with dirty data
- Before running code, always manually inspect the code.
- Always keep a copy of the original data just in case.
Pandas - Python Data Analysis Library
- Open source library providing data import and analysis functionality to python.
- Optimized data structures for data analysis: DataFrames, Time series data, Matrices.
- Provides loads of handling functions, as well as reader functions for the file formats.
Pandas - Data Structures:
- Two main data structures
- Series (1- dimensional, labeled, homogenous typed)
- DataFrame (2-dimensional, labeled, potentially heterogeneous columns)
- CSV reader imports a dataset as a DataFrame and most functions produce a DataFrame
as output.
- If a dataset is huge, it may be too much to want to process it all in a python project, and
may need to just take a subset of the data.
Pandas - Missing Data Handling
- Pandas functions for handling missing/ wrong data
- csv_read() where missing values are automatically replaced with NA
- DataFrame.dropna(), DataFrame.fillna(), DataFrame.replace()
Best to replace missing value placeholders during import to avoid later problems. Always deal
with null or None values right at the start.
Standard python csv module reads everything as string types. Pandas is a bit better but still will
fallback to string if it can't deduce the type from all values in a column. If so, we need to convert
to the appropriate type.
Converting Approach 1: Use Pandas
- astype() function on data series to convert to new types
- This fails if any entry in the series violates the new type
Converting Approach 2: Function to convert values in a given column
- We are more flexible by introducing our own clean() function in Python.
Descriptive Statistics with Pandas
DataFrame supports a wide variety of data analysis functions.
You can filter entries in a DataFrame using loc[]
- Allows to specify a boolean predicate where only those entries are selected in the
DataFrame for which the predicate is true
- Optionally also allows to select specific columns to keep in result
Frequency distributions are also able to be done with groupby() and size().
- Entries in a pandas dataframe can be grouped by a column.
Pandas is built on Numpy, which is another useful python library.
- Numpy offers various statistics for numerical data.
Matplotlib provides functionality for creating various plots. Pandas offers some easy to use
shortcut functions. Super important to be able to use these plots.
We can use boxplots to compare distributions:
- Mean and stdev are not informative when data is skewed.
- Boxplots summarise data based on 5 numbers
- Lower inner fence -Q1-1.5*IQR
- First Quartile (Q1) - Equivalent to 25th percentile
- Median (Q2) - Equivalent to 50th percentile
- Third Quartile (Q3) - Equivalent to 75th percentile
- Upper inner fence -Q3+1.5*IQR
- Values outside fences are outliers
- Sometimes include outer fences at 3*IQR
Accessing Data in RDBMS, SQL
Approaches
File Based:
- Just put the data in files (CSV or spreadsheet files)
- Analysis might be done by writing formulas or creating charts in a spreadsheet or by
writing python code (example: with pandas)
Quality issues with file based approaches:
- Data quality is left to the users to do the right thing.
- Example: keep metadata somewhere and keep it up to date
- Example: keep backups (redundancy), manage sharing
- Example: prevent changes that introduce inconsistency or violate integrity properties
Alternative: The Database Approach
- Central repository of shared data
- Data is managed by software: a database management system (DBMS)
- Data is accessed by applications or users, through DBMS always
DBMS manages data like an operating system manages hardware resources. (Django= DBMS)
A database is a shared collection of logically related data and its description. The database
represents entities (real world things), the attributes (their relevant properties), and the logical
relationships between the entities.
Advantages of database approach:
- Data is managed so quality can be enforced by the DBMS.
- Improved Data Sharing
- Different users get different views of the data
- Efficient concurrent access
- Enforcement of standards
- All data access is done in the same way
- Integrity constraints, data validation rules
- Better data accessibility/ responsiveness
- Use of SQL
- Security, backup/ recovery, concurrency
- Disaster recovery is easy
- Program data independence
- Metadata stored in DBMS, so applications don't need to worry about data
formats.
- Data queries/ updates managed by DBMS so programs don't need to process
data access routines
- Results in:
- Reduced application development time
- Increased maintenance productivity
- Efficient access.
Key Database Concepts:
- Table is an arrangement of related information stored in columns and rows.
- Field/ Attribute is a column in a table, containing a homogeneous set of data.
- Field data types is a kind of data that can be stored in a field. For example, a field whose
data type is text can store data consisting of either text or number characters, but a
number field can store only numerical data.
- Primary Key is a field in a table whose value uniquely identifies each record in the table.
A PK cannot be null.
- A record is a row in a table.
Relational Database Management System:
- A RDBMS stores data in tables as rows with multiple attributes
- Rows of the same format form a table
- Every relation has a schema, which describes the columns or fields and their types
- A RDMS is a collection of such tables
A primary key is a unique attribute which the database uses to identify a row in a table. It is a
unique ID (never null) which is filled in by the database.
When we need to refer to a record in a separate table we reference its ID as a foreign key. A
foreign key is defined in a second table but it refers to the primary key in the first table.
Entity Relationship Diagram (ERD)
- Normalised RDMS tried to avoid redundancies.
- Every fact is ideally stored only once.
- Many different notations are used.
Declarative Data Analysis with SQL
SQL
SQL is the standard declarative query language for RDBMS. It describes what data we are
interested in, does not have to get technical with how to retrieve it.
Supported commands come from two categories:
- DDL (Data Definition Language): Create, drop or alter the relation schema.
- Example: CREATE TABLE name( list_of_columns)
- DML (Data Manipulation Language): For retrieval of information also called query
language.
- Example: INSERT, DELETE, UPDATE. SELECT..FROM..WHERE
When creating a table we can also specify integrity constraints for columns. Example: domain
types per attribute or null/not null constraints. Example: Check(condition) which makes sure it
holds for every tuple in the database. E,g, Check(gender in(‘M’, ‘F’, ))
Check SQL notes below for more extensive information.
In SQL we can partition a relation into groups according to the values of one or more attributes.
A group is a set of tuples that have the same value for all attributes in a grouping list. Attributes
in select clauses outside of aggregate functions must appear in the grouping list.
- Intuitively, each answer tuple corresponds to a group, and these attributes must have a
single value per group.
Note: Predicates in the having clause are applied after the formation of groups whereas
predicates in the where clause are applied before forming groups.
DB Creation / Data Loading
Once data is ready and cleaned it needs to be stored in our database.
Approach 1: PSQL Data Loader
- Postgresql’s psql provides a command to load data directly from a CSV file into a
database table.
- Pros: Fast and straightforward, no programming needed.
- Cons: Only 1:1 mapping of CSV to tables, no data cleaning or transformation,
stops at the first error.
Approach 2: Python Loading Code
- Use pgconnect() to create a connection conn.
- Define a string with the appropriate CREATE TABLE command
- Use conn.execute(string) to do this in the DBMS
- Obtain data as a dataframe df
- Execute df.to_sql on conn
- Pros: Flexibility, use pandas to clean data
- Cons: Needs to be hand coded
Indexing and Data Partitioning
Data stored in different ways on different things.
HDD: Data stored in disk storage, however, is expensive and volatile. Read (disk to RAM) and
write (RAM to disk) operations are high cost relative to in memory operations so must be
planned carefully.
Solid State Drive (SSD):
- Pros:
- SSD have about same access characteristics
- 100x faster than HDDs
- Higher input output rates than HDDs
- Cons:
- SSDs 4-6 times more expensive than HDDs
- SSDs currently max out at about 4TB, can get 16TB HDDs
- 100X faster than HDDs mean still significantly slower than main mem
Secondary storage is needed for sheer data volume (and persistence) but it’s slow.
How to speed up access to secondary storage:
- Block wise transfer: transfer data in fixed size chunks between storage layers
- Caching/Buffering: keep hot data in memory and only use secondary storage for cold
data
- Optimised file organisation: Heap files vs sorted files, Row stores vs column stored
- Indexing
- Partitioning
The goal to improve performance consists of wanting to:
- Minimise latency
- Maximise throughput
- Minimise “cost”, which is typically expressed in the number of I/O operations needed.
The lower the better.
We use a buffer to hide slow disk access. A buffer holds copies of most useful pages of data.
Missing pages must be fetched from disk and stored in the buffer.
Logical data design - A database is a collection of relations. Each relation is a set of records (or
tuples) A record is a sequence of fields (or attributes)
Alternative file organisations:
- Heap Files - A record can be placed anywhere in the file where there is space (random
order)
- Suitable when typical access is a file scan retrieving all records
- Sorted Files - Store records in sequential order, based on the value of the search key of
each record.
- Best if records must be retrieved in some order, or only a range of records is
needed.
- Indexes - Data structures to organize records via trees or hashing
- Like sorted files, they speed up searches for a subset of records, based on
values in certain search fields.
- Updates are much faster than in sorted files
(Unordered) Heap Files:
- Simplest file structure contains records in no particular order.
- Access method is a linear scan
- On average half of the pages in a file must be read, in the worst case even the
whole file.
- Efficient if all rows are requested
- Very inefficient if a few rows are requested
- Rows appended to end of the file as they are inserted, hence unsorted
- Deleted rows create gaps in file
- File must be routinely compacted to recover space.
Sorted File:
- Rows are sorted based on some attributes
- Successive rows are stored in same pages
- Access method can be a binary search
- Equality or range query based on that attribute has cost log2b to retrieve page
containing first row
- Problem: Maintaining sorted order
- After the correct position has been determined, having to shift over to make
space to insert is very expensive.
- Hence sorted files are not used directly by DBMS but rather in the form of index
organized (clustered) files.
So far, we looked at a row store approach where all data values of a record are stored together.
For analytical workloads, a column oriented store is an alternative.
Column Stores - Pros and Cons
Advantages:
- Non needed attributes do not have to be read (Meaning less IO, more efficient)
- Better compression possibility
- Better parallel processing (Different columns can be stored at different disks)
Disadvantages:
- Full row access is more difficult to receive.
- Small table data
Indexing
The idea is a separate location mechanism from data storage. A database index allows a query
to efficiently retrieve data from a database. Basically a shortcut. Like a table of contents in a
book.
Multi attribute index: The search key in an index can be a combination of two or more columns,
not just a single column.
Syntax: CREATE INDEX name ON relation-name
Index on primary key created automatically. Can use CREATE UNIQUE INDEX to indirectly
specify and enforce the condition that the search key is a candidate key. To drop an INDEX can
use DROP INDEX index-name. Indexes are used automatically by a database - no way to
explicitly use.
Downsides to using Indices:
- Additional I/O to access pages (except if index is small enough to fit in main memory)
- Index must be updated when table is modified, can become quite costly
Clustered Index:
- In a clustered index, both index entries and rows with actual data are ordered in the
same way.
- The particular index structure (e.g. hash or tree) dictates how the index entries are
organized in the storage structure
- For a clustered index, this then dictates how the data rows are organized
- There can be at most one clustered index on a table
- Pointers used to block not direct data
- CREATE TABLE statement generally creates a clustered index on a primary key.
- To have a clustered index on another attribute, can use command: CLUSTER
TABLE name ON index
Unclustered Index: ( like the table of contents one)
- Index entries and rows are not ordered in the same way
- There can be many secondary indexes on a table.
- Index created by CREATE INDEX is generally an unclustered, secondary index.
Clustered Index is good for range searches over a range of search key values
- Use index to locate the first index entry at start of range
- This locates the first row
- Subsequent rows are stored in successive locations if index is clustered
- Minimize page transfers and maximizes likelihood of cache hits
- Can be at most one clustered index on a table
Unclustered isn't ever as good as clustered, but may be necessary for attributes other than the
primary key.
Different types of indexes:
- Tree based indexes(B+- tree); very flexible, support point queries and prefix searches
- Hash based indexes: fast for equality searches but nothing else
- Special indexes: such as bitmap indexes for OLAP or R tree for spatial databases
Covering Index: An index that contains all attributes required to answer a given SQL query
- All attributes from the WHERE filter condition
- If it is a grouping query, also all attributes from GROUP BY & HAVING
- All attributes mentioned in the SELECT clause
- Typically a multi - attribute index
- Order of the attributes is important: Prefix of the search key must be the attributes from
the WHERE
Different DBMS support different index structures
Choosing which indexes to create:
- Need to understand workload
- Which data does it need access to, what attributes need to be retrieved, which
attributes are involved in selection etc.
- Need to figure out which index to create, based on the relations that need to be indexed
and what fields need to be the search key.
- We also look at what kind of index should it be, clustered or hash or tree.
- How should the indexes be created, separate storage, own disk etc.
Choosing index possible approach:
- Consider the most important possible queries. Then consider the best plan using the
current indexes, and see if a better plan is possible with an additional index, if so create
it.
- Before creating an index, must also consider the impact on updates in the workload.
Index selection guidelines:
- Attributes in WHERE clause are candidates for index keys.
- Exact match condition suggests hash index, if platform provides this option
- Range query only supported by tree index types (b tree)
- Cluster especially useful for range queries
- Multi-attribute search keys should be considered when a WHERE clause contains
several conditions
- Order of attributes is important for range queries
- Such indexes can sometimes enable index only strategies for important queries.
- If it's a group by query with an equality on search on state, can use a covering
index.
- Try to choose indexes that benefit as many queries as possible. Since only one index
can be clustered per relation, choose it based on important queries that would benefit
most from clustering.
- An index should support a query of the application that has a significant impact on
performance
- Choice based on frequency of invocation, execution time, etc.
Distributed Data Processing
A single server has limits, for real big data processing, we need to scale out to a cluster of
multiple servers.
Two main physical design techniques:
- Data partitioning
- Storing subsets of the original data set at different places
- Can be in different tables in schema on same server or at remote sites
- Goal is to query smaller data sets and to gain scalability by parallelism
- Subsets can be defined by
- Columns - Vertical Partitioning
- Rows - Horizontal Partitioning
- If each partition is stored on a different site it's called sharding.
- Data replication
- Storing copies (‘replicas’) of the same data at more than one place
- Goal is to fail safety/availability
What is partitioned?
- Horizontal partitioning: set of rows (known as shard)
- Vertical partitioning: set of columns
How to place data into partitions?
- Round robin
- Placement of partitions is going through all nodes in rounds
- Hash partitioning
- Target node is determined by a hash function on the tuple id or key
- Range partitioning
- Each node stores a partitioned defined by a range predicate
Advantages of partitioning:
- Easier to manage than a large table
- Better availability: If one partition is down, others are unaffected if stored on a different
tablespace
- Helps with bulk loading
- Queries faster on smaller partitions, can be evaluated in parallel
Data sharding is when data is spread across multiple computers while partitioning does not.
Partitioning is about grouping subsets of data within a single database instance.
Data sharing (Fragmenting): distributing data partitions over several sites in a distributed
database
- Assumes queries only access one shard only
- Otherwise, counter productive
Scraping Web Data
Scraping the web is a way of getting data. Done through web crawlers and HTML parsing.
Webpages are written in HTML which is a semi structured data format with some similarity to
XML.
Web Scraping general approach:
- Reconnaissance: Identify source and check structure and content
- Web Page Retrieval: Download one or multiple pages from source
- Typically in a script or program that auto generates new URLs based on website
structure and URL format.
- Data Extraction: Content parsing, raw data extraction
- Data Cleaning and transformation into required format
- Data Storage analysis, combining with other data sets
HTML is not always well formed, let alone annotated or semantically marked up.
Many HTML parsers are too strict for real world usage, including pythons built in parsers. They
would stop parsing incorrectly written web pages without giving us the chance to extract data.
There are several 3rd party tools that we can use to help us.
HTML Crawling: Requests library
HTML Parsing: html5lib
Data Extraction: BeautifulSoup library
Website Crawling: scrapy framework
Many tools are available for us to take advantage of in this domain: Command line (e.g. curl,
grep, awk), Google spreadsheets, web crawlers as a service, BeautifulSoup programming
library
Tips:
- Take note of any URL parameters, find any patterns with links when accessing data,
access tokens
- Web page structure is useful to note, use the page inspector to narrow down what you're
looking for
Many websites provide a robots.txt file, meant for web crawlers who should check this content
first before starting crawling a website.
Different rules in the robots.txt file:
- Specifies permissions you have
- Specifies specific directories you're allowed to visit
- Specifies specific programs you're allowed to run
- Specifies what frequency of requests you can make
Be a good net citizen, check, ask, don't overload and dont steal.
Web scraping is not illegal, what may be illegal is the way the information is being used.
Therefore, it is very important to read all terms and conditions and robots.txt.
Data cleaning is its own topic, and extremely important as data is rarely clean. Important to
clean before further processing.
Extracting data from HTML
Web retrieval in python (single pages):
- Two forms of requests: GET (optional with parameters) or POST (with parameters)
- Making simple web requests in python
- Python requests library:
URL general format = protocol://site/path_to_resource
Website crawling in Python (multiple page scraping)
- Scrapy
- Extensive python framework to implement a web ‘spider’ - a program that follows
multiple links along the web.
- Can extend this spider class with its own functionality which extracts parts of the
visited pages while the spider follows further links.
- Selenium
- A programmable web browser for which python binding exists which allows you
to actually send requests as if a user would have clicked on links or used a page.
- Typically using for automatic testing of websites
- Also can be used for crawling of complex websites
General Structure of a web page:
- Head: Title, style sheets, scripts, meta data
- Body: Headings, text, lists, etc
How to select content in a webpage:
- Text patterns
- Simple but not really great for complex patterns as we rely on some parsing of
our own.
- DOM navigation
- Document object model
- CSS selectors
- Based on tag types, class specifications and IDs elements
- Easy to specify, but depends on css classes and ids being used well
- XPath expressions
- Powerful language that allows navigation along document tree and selects all
nodes or even subtrees which match path expression
- Can contain filter predicates e.g. on values of XML/HTML attributes
BeautifulSoup provides heaps of functions that supports CSS selectors:
Example: Finding an element by type: Elements = page.content.find_all(“h3”)
Could even extract the data straight into a pandas dataframe.
Semi Structured Data and Web Services and
NoSQL
Web Services
Getting Data via service-APIs (Web Services)
- Many websites or web services provide programmable APIs which allow you to explicitly
request data for a program to process, instead of pages to view in your browser.
- Typical request formats: JSON, XML
Web Services:
- Client program sends HTTP request to Web-Service API
- GET : get direct response, parameters encode in URL
- POST: web form request, parameters in request itself
- Web server application: answers with either static data from files, or dynamically
constructs content based on request
- Database server: persistent web state
- Response from web service: JSON or XML
Pandas can transform a JSON string into a dataframe.
Complex responses might need to select the appropriate part first using the json library.
Some APIs also allow additional parameters to be given
HTML, XML, and json are examples of so called semistructured data models (data with
non-rigid structure)
Characteristics of semi structured data:
- Missing or additional attributes
- Multiple attributes
- Nesting: semi structured objects (documents) are hierarchical
- Different types in different objects
- Heterogeneous collections
- Self describing, irregular data, no priority structure.
HTML is mainly for web page design, XML is the more structured ‘cousin’ for data exchange.
Some web services can be asked to send XML rather than HTML pages.
Also common in enterprise data exchange, or open data sets.
XML is a similar idea to HTML:
- Semi structured format
- Tag markup
- Main difference is that XML has user defined tags, while HTML are predefined tags.
Logical Document Structure:
- XML refers to its objects as elements
- The top most element is called the root or document element
- Elements are bound by tags:
- Every opening tag must have matching closing tag
- Tag names are ‘case sensitive’
- Tags cannot overlap
- Tags of empty elements have special syntax.
- It follows a tree structure
- Soley data type for leaf elements: PCDATA (parsable character data)
Document Type Definition: (DTD)
- XML defines general syntactic properties with user defined tags.
- The grammar of an XML document is specified using a so-called document type
definition.
- DTD is like a sort of schema, establishing constraints on element structure and content +
definition of entities.
XML Conformance: well formed vs. valid
- Well formed: document satisfies XML syntax constraints (matching tags)
- Valid: document is well formed and satisfies a predefined schema, as formalised in a
DTD.
How to query or filter XML:
- DOM Navigation
- XML documents represent a tree structure which can be navigated using XML’s
Document Object Model
- XPath
- XPath expressions allow to query single values, nodes or whole subtrees within
one XML document
- An XPath expression takes a document tree as input and returns a multi-set of
nodes of the tree.
- Expressions starting with / are absolute path expressions.
- XPath expression is locationStep1/locationStep2/...
- XQuery
- XQuery builds on XPath to specify a declarative query language over a set of
XML documents
JSON Overview:
- Unordered sets of name-value pairs with nested structure.
- A text based, semi structured format for data interchange
- XML stronger in database area due to SQL/XML and XPath/Xquery
Relational World:
- Schema first, rich type system for attributes, integrity constraints
- First Normal Form: only atomic type attributes allowed.
- Many different tables, joins needed
Semi Structured World:
- Self describing data with flexible structure
- Nested data model with tree structure
- Optional attributes, grammar, schema and vocabulary.
Storing Semi-Structured Data
Some RDBMS, such as psql, support JSON and XML.
A SQL/XML standard is widely available. It provides XML data type to store XML and integrates
the XML features.
PostgreSQL has two native JSON data types: JSON and JSONB.
NoSQL db can be used for JSON data.
We store XML natively as a tree structure. Supports navigation via efficient storage and
indexing.
To process values of XML, PostgreSQL offers the functions xpath and xpath_exists.
Support for storage, querying, and JSON export:
- JSON type stores exact copy of JSON data in attribute
- JSONB type stores binary, decomposed versions.
- More overhead for inserts, but claims significantly faster querying.
NoSQL
Traditional dbms platforms were relational (sql as query language, relational data model) and
also powerful (lots of features for integrity, security, tuning) expensive, hard to administer.
New systems were designed and described as NOSQL because they gave up features of
traditional platforms. Consisted of simpler data models, simpler queries and updates, weaker
security guarantees. These are often open source and sometimes free.
Over time, the new platforms added features like joins, triggers and integrity while old ones
added support for more diverse data models. “Not Only SQL” was used for these systems.
Relational DBMS
- Set of fixed structured tuples; joins; SQL; transactions
NoSQL
- Document oriented databases
- Nested, hierarchical lists of key value pairs
- NoSQL column stores
- Pure key value stores
So far there is no standard data model or API for ‘NoSQL’
One vital property of many of the new systems is that data can be inserted without a schema
already being defined for it. This is great for receiving data from data sources that have not
been fully cleaned.
Text Data Processing
Text data usually doesn't have a predefined data model, is unstructured and is typically text
heavy, but may contain dates, numbers and facts as well. This results in ambiguities that make it
more difficult to understand than data in structured databases.
Structured data is:
- Data in fields, easily stored in databases, e.g. sensor data, financial data, click streams
etc.
Text is unstructured data, 80-90% of all potentially usable business information is unstructured.
Machine learning tasks:
- Supervised learning - predict a value where truth is available in the training data. E.g.
prediction, classification, and regression.
- Unsupervised learning - find patterns without ground truth in training data. E.g.
clustering, probability distribution estimation, finding association, dimension reduction.
Main difference: Supervised learning is done using a ground truth, a.k.a. We have prior
knowledge of what the output values for our samples should be.
A feature vector is a vector containing multiple elements about an object. Represent numeric or
symbolic characteristics.
Converting text to feature vectors is possible. We represent a document as a multiset of words.
Keep frequency information, disregard grammar and word order.
Tokens are a basic meaningful unit of a sentence or a document.
Tokenisation process:
- Split a string (document) into pieces called tokens
- Possibly remove some characters, e.g. punctuation
- Remove “stop words” such as “a”, “the”, “and” which are considered irrelevant
Normalization is the process of converting a token into its base form.
- We map similar words to the same token.
- Stemming/lemmatization: Avoid grammatical and derivational sparseness.
- Lowercasing
Stemming: An elementary rule based process for removing inflationary forms from a given
token.
Lemmatization: A systematic process of removing the inflectional form of a token and
transforming it into a lemma.
We give a binary indicator feature for each word in a document, and only take in each word
once. We then consider term frequency and give more weight to terms that are common in the
document. We might perform damping to reduce the impact of high counts.
Modeling the distribution of terms follows zipf’s law. Basically an inverse function relation.
Inverse document frequency: Giving less weight to terms that are common across documents.
- IDF = log(|total docs|/|docs containing term|)
- TFIDF = TF *IDF
TF-IDF: term frequency inverse document frequency. Takes into account word frequency in a
doc but offsets it by the number of documents that contain the word.
Vector space model: Documents are represented as vectors in term space
- Terms are usually stems, document vector values can be weighted by e.g. frequency
- Queries represented the same as documents
All document vectors together: Document Term Matrix (Feature-Matrix). We can plot the vectors
on a coordinate plane. Documents that are close in direction and length are similar to one
another.
Scikit-learn library provides corresponding functionality via its CountVectorizer. CountVectorizer
converts a collection of text documents to a matrix of token counts.
Unstructured data can also be input for regression (predict a numeric value, not a categorical
label). E.g. predicting box office returns from movie reviews.
Curse of Dimensionality:
Theoretically, increasing the number of features should lead to better performance, however it
actually leads to worse performance. Called the curse of dimensionality. The number of training
examples required increases exponentially with dimensionality. Dimensionality reduction can
improve the prediction accuracy.
Structured Prediction:
Problems where output is a structured object, rather than discrete or real values.
Knowledge base population (KBP):
Aim is to build structured knowledge bases from massive unstructured text corpa.
Subtasks:
- Entity linking: identify mentions of entities, link to KB or NIL
- Slot filling: Extract and populate facts for given entity
Geospatial Data
Spatial data is about objects and entities which have a location or a geometry. A special form is
geospatial data which refers to data or information that identifies the geographic location of
features and boundaries on earth.
Example applications of spatial data:
- Location aware services: food finder, web apis
- Geographic information systems
- Multimedia Databases: nearest neighbour queries etc
Spatial Database Management System (SDBMS):
- Handle large amount of spatial data stored in secondary storage
- Spatial semantics built into query language
- Specialized index structure to access spatial data
Geographic Information System (GIS):
- SDBMS Client
- Characterized by a rich set of geographic analysis functions
- SDBMS allows GIS to scale to large databases, which are now becoming the norm
- Information in a GIS is typically organized in layers
Types of spatial data:
- Point data: point in multidimensional space
- Region data: objects have spatial extent with location and boundary
Object model (Entity based model of spatial information):
- Objects: distinct identifiable things relevant to an application
- Attribute: A simple property of an object
- Operations: function maps object attributes to other objects
- Example: Roadmap
- Objects: roads, landmarks
- Attributes:
- spatial: polygon land boundary
- non spatial: name, type, speed limit etc.
- Operations: determine center line, determine length, etc.
Classifying spatial objects:
- Spatial objects are spatial attributes of general objects.
- 0-dimensional ( points), 1-dimensional (curves), 2-dimensional (surfaces)
Coordinate systems:
- Each instance of a spatial type has a spatial reference identifier (SRID)
- Basically identifies the coordinate system in use.
- List typically according to prevalent standards, currently 390 distinct systems
- Types:
- Cartesian coordinates: point positions from a defined origin along axes
- Geodetic or geocentric coordinates: angular coordinates ( longitude, latitude)
- Projected coordinates: planar cartesian coordinates that result from mapping
point on earth to a plane.
Projections are trying to fit something round, like the earth, on something flat. This introduces
distortion.
Classifying Operations:
- Set based: 2 dimensional spatial objects are sets of points.
- Topological operations: Boundary of usa touches canada
- Directional: Brisbane is north of Sydney
- Metric: Brisbane is 730 km from Sydney
Topology is the study of topological relationships. Topological relationships:
- Invariant under elastic deformation (without tear, merge)
Nine intersection model of topological relationships:
- Specifies many of the possible topological relationships.
Database Support for Geo-Spatial Data
Postgresql supports spatial data out of the box.
Geometric Types:
- Point, line, box, path, polygon, circle
- Geometric Functions
- Indexing of Geometric Types
For geographic data there is a PostGIS extension that supports:
- Geometry types: shapes on a plane
- Geography types: shapes on a sphere
- Spatial reference system, special predicates, R-Tree indexing.
PostGIS: Geometry vs Geography Type
- Geometry types: shapes on a plane
- Geography types: shapes on a sphere
- This difference affects all calculations of geometries/geographies.
- Because of the more complex calculations, on PostGIS less operations are
defined on geographies than on geometries.
Types of spatial queries:
- Spatial range queries - “find all cities within 100 km of Sydney”
- nearest neighbour queries - “find the 3 cities nearest to Sydney”
- spatial join queries - “find all cities nearest to an ocean”
Because spatial query processing is complex, typically a two step ‘filter refine’ strategy is used
for querying spatial data.
- Filter: objects with min bounding box intersection query regions
- Refine: Query region really intersecting only with B and C
For 1 dimensional point data: Classical B+ Tree
For n dimensional spatial data: Spatial R+ Tree
The R tree:
- A tree structured index that remains balanced on inserts and deletes.
- Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one
interval per dimension.
- Properties:
- Leaf entry =< n dimensional box, rid > (Box is tightest bounding box for data
object)
- Non leaf entry =< n-dim box, ptr to child node > (Box covers all boxes in child
node
- All leaves are at the same distance from the root.
- Nodes can be kept 50% full
Geopandas
Provides two main GeoData types:
- GeoSeries and GeoDataFrame
GeoSeries:
- Main geometry building block on which geodataframe is built
- A geopandas geometry is an extension of a shapely geometry object
- 6 classes closely resembling the OGC data model.
- Single entity: Point, Line, Polygon
- Homogeneous Entity: Multi-Point, Multi-Line, Multi-Polygon
CRS - Coordinate Reference System
- Geopandas use a CRS to tell python how coordinates relate to places on earth.
- Can manually set a projection if we want. (portrays surface of earth on a plane)
Geodataframe allows you to combine normal data features and values with spatial data coming
from geoseries. The most important feature is that it always has one geoseries column that
holds a special status. This column is referred to as ‘geometry’. When a spatial method is
applied to a Geodataframe, this command will always act on the ‘geometry’ column.
Attribute join: joins two geo data frames using their merge function()
Spatial join: joins two geometry objects based on spatial relationship
Data Exchange of Spatial Data Using XML and JSON
Web APIs are very important for modern/mobile applications.
We use geoJSON to encode graphic data structures on JSON.
- It defines spatial data as features of standard JSON objects. Supports various geometry
types.
Use cases of all this information:
1. Ingesting data with spatial features
2. Using web APIs with support for GeoJSON to lookup further data
3. Use GeoJSON or KML to export or visualize spatial data
Time Series Data
Almost all data is qualified with time (periods or points).
Big data involves taking a slice in time across many different channels. But long data involves
looking at information on a much longer timescale.
Generally, there is limited support for temporal data management in DBMSs.
- Conventional (non temporal) DBs represent a static snapshot.
- Management of temporal aspects is implemented by the application.
Temporal Databases
A temporal database provides built-in support for the management of temporal data/time.
How to represent time in traditional databases?
- Basic building blocks: SQL data types DATE, TIMESTAMP
- Modelled as an ordered set of time instants/points.
- Can add attributes of these time data types to any table, interpretation is application
specific
Concepts in temporal databases:
- Temporal data types
- Instant - Something happening at an instant of time
- Interval - A length of time, a duration
- Period - An anchored duration of time
- Kinds of time
- User defined time - An uninterpreted time value
- Valid Time - When a fact was true in the modelled reality.
- Transaction Time - When a fact was stored in the database.
- Temporal statements
- Current - now
- Sequenced - At each instant of time
- Non Sequenced - Ignoring time
Temporal Data Types:
- SQL supports time instants and intervals (but no periods).
- Instant data types: DATE, TIMESTAMP, TIME
- Interval data types: Various specification options, e.g. INTERVAL YEAR TO MONTH
- Many DBMS only support time instants but no intervals.
- They must hence be simulated with two time instants (start and end)
Kinds of Data:
- User defined time. E.g. a birthday or publication time
- Valid time and transaction time
- A table can be associated with none, one, two or all three kinds of time
Semi Temporal Databases
- We want to track when a current fact is true. A “since when” type attribute.
- When this “since when” attribute is tracked the table becomes ‘semi-temporal’.
If we want to keep a history, we need a temporal database.
- We want to keep the history of when employees were assigned to what positions. Since
there's no time period data in sql, we can simulate it with two dates.
For the handling of NOW or UNTIL CHANGED data we use the max timestamp approach,
which simulates ‘the end of time’.
Valid Time (aka fact time): Records the time when a fact is true in the real world. Can move
forwards and backwards.
Transaction Time (aka database time): Records the history of database activity. Only moves
forwards. Therefore allows rollback (very useful for auditing).
Transaction time databases retain and provide access to prior states of a database. Transaction
time does not have to be reflected explicitly in the relational schema (as compared to valid time
periods). Transaction time does not have to be reflected explicitly in the relational schema.
Temporal Support in Current Database Systems
SQL has added period definitions as metadata to tables (and not as a new data type).
A period is a conceptual grouping of a physical start time and end time attribute/column. It's a
half closed interval [Estart, Eend).
There is a new data type DATARANGE to represent (time intervals).
Indexes on range types in psql provide a powerful extensible index. Called a general inverted
search tree (GIST). Used on periods and combinations of period and other attributes. GIST is
also very useful for spatial data. It improves range indexing for many predicates.
Time Series Data: Series of observations (data points) made sequentially over time at discrete
time points, usually at equal intervals.e.g. Sleep patterns, weather observations.
Time Series Analysis: Extract statistics or other characteristics of the time series; identify
patterns, data visualization. Methods used for this can be split into time-domain methods and
frequency-domain methods.
Time Series Forecasting: Find a model of the time series to predict future values based on
previously observed values.
Storage approaches for time series data:
- Point based representation: multiple rows with atomic data types
- Sequence based representation: single row with array of time point data
- Dedicated time series database: e.g. TimescaleDB
Point Based Representation in Flat Table
- Each table is time stamped with a time point/instant
- Most basic and simple data model
- Time stamps are atomic values and can be compared
- Good for time domain analysis, allows for indexing.
Working with Point Based Representation:
- Individual INSERT, UPDATE statements
- Need to be careful with granularity and timezone of data.
- Joins to dimension tables
Sequence based representation in Nested Table
- Single row with array of time point data
- Requires array data type
- Discrete time points can be represented as:
- Part of the array
- Implicitly by position
Image Data Processing
Image analysis is the extraction of meaningful information from images; mainly from digital
images by means of digital image processing techniques.
Images can be described as vector graphics or raster data.
Raster images:
- Matrix with fixed number of rows and columns
- Digital images consist of a fixed number of picture elements, called pixels.
- Each pixel represents the brightness of a given colour.
- Colour depth: Different number of channels.
- Raster images can be created in multiple ways:
- Digital photography / video
- Image sensors
- Scanners
Types of Images:
- True Colour or RGB Image: Each pixel has a particular colour; That colour is described
by the value of pixels in the RGB channel.
- Number of bits per pixel defines how many colors can be represented.
- Gray-Scale Image: Single channel, each pixel is a shade of gray.
- Binary Image: Each pixel is just black or white. Single channel can be 1 bpp, but works
with higher. Referred to as ‘mask’ in the image processing domain.
Many different types of image file formats:
- Lossless (e.g. TIFF, GIF, PNG) versus lossy compressed (JPEG)
Digital images have metadata:
- Especially digital photography
- Several standard format, such as EXIF or XMP
Typically includes:
- When captured, how, which camera used, which settings, location, etc.
The metadata of one user is data for another.
Image metadata in form of key value pairs:
- Geo-location analysis of images
- Extraction of metadata such as author and description.
- If lucky, key words.
Otherwise, we need to look at the actual image data. Image data itself is a form of unstructured
data. (Unstructured data refers to information that does not have a predefined model)
Image Analysis Process:
1. Image acquisition
2. Preprocessing
3. Feature extraction
4. Classification / Image similarity / Object detection / Intensity based analysis
Aspects of image processing:
- Image Enhancement: Processing an image so that the sult is more suitable for an
application. (e.g. sharpening or deblurring, improving contrast, removing noise)
- Image Restoration: This may be considered as reversing the damage done to an image
by a known cause. (e.g. removing of blur caused by linear motion)
- Image Segmentation: This involves subdividing an image into constituent parts, or
isolating certain aspects of an image. (e.g. finding lines, circles or particular shapes.)
Usually image data analysis involves the following steps:
- Preprocessing / Image enhancement
- Such as removal of imperfections or noise
- Extraction of region of interest using binary mask
- Analysis / Measurements
- Binary image, grayscale intensities
In Image Processing, filters are mainly used to suppress either the high frequencies in the
image, i.e. smoothing the image, or the low frequencies. I.e. enhancing or detecting edges in
the image.
Dynamic Range Compression:
Dynamic range of an image is the ratio of brightest to darkest intensity values
- Image will have high contrast if the dynamic range is high.
- Dynamic range of a scene usually exceeds the dynamic range of a viewing device.
Dynamic range compression can be used to compress and redistribute the dynamic range of
the scene in an image into the visual range.
Image Restoration: Point spread function (PSF) / Convolution
- PSF is described as the impulse response of the optical system. E.g. diffraction of light
which determines the microscope's resolution limit.
- The captured image becomes convoluted.
Image restoration: Deconvolution
- Deconvolution is an algorithm based process used to reverse the effects of convolution
on an image.
Morphological Image Processing:
- Broad set of operations that process images based on shapes. Goal: removing
imperfections in images.
- Morphological techniques probe an image with a small shape or template called a
structuring element.
- The structuring element is a small binary image, i.e. a small matrix of pixels each with a
value of zero or one.
Morphological Dilation:
- The value of the output pixel is the maximum value of all the pixels in the input pixels
neighbourhoods. In a binary image, if any of the pixels is set to the value 1, the output
pixel is set to 1.
- It's how we strengthen small signals which can help for correctly identifying connected
components.
Morphological Erosion:
- The value of the output pixel is the minimum value of all the pixels in the input pixels
neighbourhood. In a binary image, if any of the pixels is set to 0, the output pixel is set to
0.
Connectivity defines which pixels are connected to other pixels.
4-connected: Pixels are neighbours to every pixel that touches one of their edges.
8-connected: Pixels are neighbours to every pixel that touches one of their edges or corners.
ROI Segmentation by Thresholding and Histogram
Thresholding creates binary images from grey-level ones by turning all pixels below some
threshold to zero and all pixels above that threshold to one. Basically the separation of light and
dark regions.
Global thresholding chooses threshold T that separates objects from background.
Thresholding issues:
- Many objects at different gray levels
- Variations in background gray level
- Noise in image
Global vs Adaptive / Local Thresholding
- Local threshold T(x, y) is calculated for each pixel, based on some local statistics such
as range, variance, or surface fitting parameters of the neighbourhood pixels within a
local block of size wxw.
Binary image uses:
- Object recognition
- Spatial location
- Size measurements
- Surveillance
Histogram of pixel intensity values:
- Histograms plot how many times each intensity value in the image occurs.
- Different images can have the same histogram.
- We can not reconstruct an image from a histogram.
We can detect bad exposure using histograms.
Feature Extraction
Image Feature: Simple pattern within image, colour information, metadata
Core idea of using features: transform visual information into vector space.
Image similarity search:
- Core idea: extract feature vectors from images and build an index of them.
- Search: convert query image into feature vector by some method, compare feature
vectors.
How to extract features from images?
1. Image descriptors (white box algorithms)
2. Neural networks (black box algorithms)
Many white box algorithms for feature extraction, typically based on image gradient: change of
intensity or color of image.
Big Data
Big Data for Businesses:
- Google Ads Searches
- Predictive marketing
- Fraud detection
Big Data Challenges:
- Data Privacy
- Some data sources, allowing tracking of anyone.
- Need to deal with privacy laws
- Data Security
- Can your users trust you to keep the data safe
- Data Discrimination
- Is it acceptable to discriminate against people based on data on their lives?
- Credit card scoring, health insurance?
- Check These:
- Are you working on a representative sample of users/consumers?
- Do your algorithms prioritize fairness? Aware of possible bias?
- Keep in mind, Validity (data quality), Veracity (data accuracy / trustworthiness)
Analysing Big Data:
Data is either too large, too fast, or needs to be combined from diverse sources for processing.
There is a need for a scalable platform, processing abstractions.
Scale Up: To scale with increasing load, buy more powerful, larger hardware. From single
workstation, to dedicated db server, to large massive parallel database appliance.
Scale Out: For real big data processing, you need to scale out to a cluster of multiple servers.
Challenges:
- Scale Agnostic Data Management:
- Sharding for performance ( sharding = dividing into smaller parts)
- Replication for availability
- Scale Agnostic Data Processing:
- Parallel processing
- Ideally the system is never down and can handle failures.
MapReduce facilitates concurrent processing by splitting up data into smaller chunks and then
parallel processing them on different servers. In the end, it's all aggregated again.
Map Reduce Overview:
- Scans large volumes of data
- Map: Extract some interesting information
- Applies a given function f to all elements of a collection; returns a new collection
- Shuffle and sort intermediate results.
- Reduce: Aggregate intermediate results
- Applies a given function g to all elements of an input list; produces, starting from
a given initial value, a single (aggregate) output value.
-
A standard map reduce task is similar in its functionality to declarative aggregation queries in
SQL.
MapReduce Pros:
- Very flexible due to user defined functions.
- Great scalability.
- Easy parallelism
- Fault tolerance
Cons:
- Requires programming skills and functional thinking.
- Relatively low level, even filtering to be coded manually.
- Batch processing oriented.
Distributed Data Analytics Frameworks:
- Apache Hadoop: open source implementation of original MapReduce from google.
- Apache Spark: Distributed cluster computing framework.
- Apache Flink: Similar to spark but emphasize on build in dataflow optimiser.
- Apache Hive: provides an sql like interface on top of hadoop
SQL Notes
Filtering
BETWEEN selects values within range.
Inclusive.
SELECT *
FROM observations
WHERE air_temp BETWEEN 20 AND 29.9;
Can filter with strings. Case sensitive. SELECT *
FROM Observations
WHERE city = 'sydney';
Summarising Data
Aggregate functions allowed. COUNT(),
SUM(), AVG(), MAX(), MIN()
SELECT MAX(wind_speed) FROM
Observations;
Dates can be worked with in format
YY-MM-DD
SELECT avg(air_temp)
FROM observations
WHERE date > '2016-12-31'
AND date < '2017-02-01'
AND city = 'Sydney';
DISTINCT returns only distinct values SELECT COUNT(DISTINCT city)
FROM Observations;
Transforming Data
ORDER BY orders results. ORDER BY date
sorts chronologically. Can add in ASC or
DESC keywords.
SELECT *
FROM Observations
ORDER BY air_temp;
LIMIT to limit output size. SELECT *
FROM Observations
WHERE city = 'Perth'
ORDER BY air_temp DESC
LIMIT 10;
random() helps generate random numbers or
sort randomly
SELECT *
FROM Observations
ORDER BY random()
LIMIT 20;
AS lets us rename columns. SELECT MAX(air_temp) AS "max air temp",
FROM Observations;
Adding constant columns is possible. SELECT date AS "date",
'Celsius' AS "unit"
FROM Observations;
ROUND(x, 1) Lets us round info to a specific
decimal point.
SELECT date AS "date",
air_temp AS "original_temp",
ROUND(air_temp * 9 / 5 + 32, 1),
'Fahrenheit' AS "unit"
FROM Observations
WHERE city = 'Perth';
|| can be used to concatenate for strings. SELECT date AS "date",
air_temp || '°C' AS "temperature"
FROM Observations
WHERE city = 'Perth';
SELECT city || ', ' || state AS "cities"
FROM Cities;
TO_CHAR() can be used to convert date
value into a string of specified format.
SELECT TO_CHAR(date, 'DD Mon YYYY')
AS "date",
air_temp || '°C' AS "temperature"
FROM Observations
WHERE city = 'Perth';
CASE is basically an if else statement SELECT obsdate AS "observation date",
air_temp AS "temperature",
'Celsius' AS "unit",
CASE
WHEN air_temp >= 30 THEN 'Hot'
WHEN air_temp > 10 AND air_temp
<30 THEN 'Cool'
ELSE 'Cold'
END AS "category"
FROM Observations
WHERE city = 'Melbourne';
Joins
Join conditions can allow access
to multiple tables.
SELECT station, site
FROM Stations, Organisations
WHERE Stations.stnowner = Organisations.code
AND Organisations.name = 'NSW';
SELECT count(obsvalue)
FROM Sensors S JOIN Measurements M USING
(sensor)
WHERE S.description = 'Water Temperature'
AND S.metric = 'celsius';
SELECT Stations.site, Organisations.name
FROM Stations
JOIN Organisations ON (stnowner = code);
NATURAL JOIN combines the
rows whose values agree in all
common attributes.
*Natural join shows join attributes
only once, instead of explicit join.
*If no common attributes,
produces cross join
SELECT *
FROM Measurements NATURAL JOIN Sensors;
SELECT S.station, obsvalue, site, stnowner
FROM Measurements M NATURAL JOIN Sensors Se,
Stations S
WHERE S.site = 'Murray River at Swan Hill'
AND '2007-02-01'<= M.obsdate AND M.obsdate <
'2007-03-01'
ORDER BY obsdate, sensor ASC;
SELF JOIN can be used to
compare entries in the table.
Relating two rows from the same
table.
SELECT B.site, B.commence, B.stnowner
FROM Stations A, Stations B
WHERE A.site='Murray'
AND B.commence < A.commence;
^Listing all stations which commenced before Murray
1. When you list more than one table in the FROM clause of a query, make sure that every
table is mentioned with at least one JOIN operator or in at least one explicit join
condition.
2. When using the NATURAL JOIN operator, ensure that both tables indeed share common
attributes. Otherwise the query result will be the cross product of the tables which is
typically wrong.
3. When using the NATURAL JOIN operator, ensure that you indeed want to join on all
common attributes.
4. Consider using JOIN ... USING (...) with an explicit list of join attributes instead of the
NATURAL JOIN operator. This avoids the issues (2) and (3) above, while also
future-proofing your query against schema changes.
5. Do not hard-code join values in your queries to avoid a join with another table.
Remember that databases can change over time.
Incomplete Data
UPPER or LOWER used to account for
spelling inconsistencies.
SELECT *
FROM Sensors
WHERE UPPER(metric) = 'PH';
IN used to account for spelling variants SELECT *
FROM Stations
WHERE state IN ('VIC', 'Victoria');
LIKE used to check for string value
matching a given pattern. % is used as
a placeholder. Case sensitive. *can use
similar to
SELECT *
FROM Stations
WHERE site LIKE 'Murray River%';
__ is another placeholder for matching
single characters.
SELECT *
FROM Organisations
WHERE code LIKE 'S__';
SIMILAR TO command can also be
used for pattern matching.
SELECT *
FROM Stations
WHERE site SIMILAR TO 'Murray (at|@) %';
COALESCE(expr, value if null) returns
a second argument if expression in first
evaluates to null. Two arguments
should be the same type so may need
to do some casting.
SELECT station, obsdate, level_water,
COALESCE(dis::text, '[UNKNOWN]') AS dis,
COALESCE(tem::text, '[UNKNOWN]') AS tem,
COALESCE(ec::text, '[UNKNOWN]') AS ec
FROM Measurements;
If dis, temp, ec evaluate to null we place
[UNKNOWN] in its place.
All together SELECT station, obsdate,
CASE
WHEN temperature = '-' Then NULL
WHEN temperature = '--' Then NULL
WHEN temperature = '' Then NULL
WHEN temperature = 'n/a' Then NULL
WHEN temperature = ' ' Then NULL
WHEN temperature LIKE '<%' THEN NULL
WHEN temperature LIKE '>%' THEN NULL
ELSE ABS(CAST(temperature AS Float))
END AS "temperature"
FROM Measurements
WHERE obsdate >= '2009-04-01' AND obsdate
<= '2009-4-30'
ORDER BY station asc, obsdate asc;
Note that COUNT(*) counts NULL values. When you specify a column to COUNT(), it ignores
NULL values.
Datasets to Databases
INSERT enters a new row to
specific table
INSERT INTO VALUES (...);
UPDATE command modifies data
in one or more row
UPDATE Country
SET name= 'Confoederatio Helvetica'
WHERE short_code = 'CH';
DELETE removes one or more
rows from the specified table.
DELETE FROM Languages
WHERE name = 'German';
CREATE TABLE command lets
you make a new table.
CREATE TABLE Continent (
id INTEGER,
name VARCHAR(50));
INSERT INTO Continent VALUES (1, 'Asia');
SELECT * FROM Continent;
PRIMARY KEY Declares an attribute (or several) as the unique
identifier for rows in the table
NOT NULL Forces every row in the table to have a value for this
attribute
CREATE TABLE Continent (
id INTEGER PRIMARY KEY,
name VARCHAR(50) NOT NULL);
DEFAULT Specify a default value for an attribute
REFERENCES works as a foreign
key. Can have multiple foreign
keys.
CREATE TABLE (
... REFERENCES (),
...
);
DROP TABLE command delete the
table basically
A foreign key constraint specifies that values stored for an attribute (or set of attributes)
in one table must match values stored in a different table.
COPY FROM command to upload CSV
data
CREATE TABLE Measurements (
station VARCHAR(20) NOT NULL,
obsdate Date NOT NULL,
level_water VARCHAR(20),
discharge FLOAT,
temperature FLOAT,
ec25 FLOAT,
PRIMARY KEY (station, obsdate)
);
COPY Measurements FROM
'Measurements-eg1.csv' CSV HEADER;
Dealing with explicit placeholders, we can
change placeholder command slightly.
COPY Measurements FROM
'Measurements-eg3.csv' WITH (FORMAT
CSV, NULL '--', HEADER);
Sets
UNION used to combine output of
two separate SQL queries. Add
union all if you want to keep
duplicates.
SELECT obsdate FROM WeatherObservations
UNION
SELECT obsdate FROM WaterMeasurements;
iNTERSECT is the set intersection
EXCEPT is the set difference
VIEW is a SQL query we can save
and use again. Results are
dynamically retrieved every time.
Can delete with drop command.
CREATE VIEW AprilTemperaturesNSW AS
SELECT temperature AS "water_temp"
FROM Measurements NATURAL JOIN Stations
WHERE temperature is not null
AND DATE_PART('Month', obsdate) = 4
AND Stations.state = 'NSW'
SELECT *
FROM AprilTemperaturesNSW;
Subqueries
A subquery is
used to return
data that will be
SELECT obsvalue AS "max_temp"
FROM Stations NATURAL JOIN Measurements
WHERE sensor='temp' AND obsvalue = ( SELECT MAX(obsvalue)
used in the main
query as a
condition to
further restrict
the data to be
retrieved.
FROM Measurements
WHERE sensor='temp' );
^Can be used in where and from clause
EXISTS is used to test for the
existence of any record in a
subquery. Returns the info if one or
more records exist.
SELECT sitename
FROM Stations S
WHERE EXISTS ( SELECT *
FROM Measurements M
WHERE M.stationid = S.stationid
AND M.obsvalue < 0 );
^lists all stations that have at least one
measurement with negative value, to find data
inconsistencies
NOT EXISTS can be used to make
sure that a consistency constraint
holds for all data entries. Done to
check that no entry exists where
condition is not true.
SELECT S.name
FROM SweetsProducts S
WHERE NOT EXISTS ( SELECT *
FROM SweetsRatings R
WHERE R.name=S.name AND
R.rating <= 3 );
^lists products where all ratings are above 3
SELECT film_id, title, release_year
FROM Films F
WHERE NOT EXISTS (SELECT film_id FROM
Film_Actor FA
WHERE F.film_id = FA.film_id)
ORDER BY film_id;
^list all films with nobody acting in it
EXTRACT is a command used to
retrieve some information from the
date attribute.
SELECT DISTINCT EXTRACT(year FROM
obsdate) AS year_of_measurement
FROM Measurements
ORDER BY year_of_measurement;
examples
SELECT stationid, sitename
FROM Stations
WHERE stationid IN ( SELECT M.stationid
FROM Measurements M
WHERE EXTRACT(year from M.obsdate) = 2010 );
SELECT film_id, title, actor_id, given_name, family_name
FROM Films JOIN Film_Actor using (film_id) JOIN Actors using (actor_id)
WHERE Films.film_id IN (SELECT film_id FROM Film_Actor FA, Actors A
WHERE A.given_name = 'PENELOPE' AND A.family_name = 'GUINESS'
AND FA.actor_id = A.actor_id);
A co-related subquery is executed for each tuple of the outer query. This means that its
search condition refers to at least one attribute of the outer query.
A non-co-related subquery is executed only once and then its result is used for all tuples
of the outer query.
Group by allows us to explore the data per a certain aspect of the data. Include all non
aggregated attributes in the group by clause.
GROUP BY allows us to
explore the data per a certain
aspect of the data.
SELECT stationid, COUNT(*) AS "num"
FROM Measurements
GROUP BY stationid;
examples
SELECT release_year, COUNT(film_id) AS "number_of_films"
FROM Films
GROUP BY release_year
ORDER BY number_of_films desc, release_year;
SELECT category_id, name, COUNT(film_id)
FROM Category left outer join Film_Category using (category_id)
GROUP BY category_id, name
ORDER BY COUNT(film_id) DESC, name;
SELECT stationid, COUNT(*) AS "num_measurements"
FROM Measurements
WHERE sensor = 'level' AND EXTRACT(year from obsdate) = 2014
GROUP BY stationid
HAVING COUNT(*) >= 360;
HAVING allows us to specify SELECT name, COUNT(stationid) AS "num"
what data groups we want to
consider for query result
FROM Organisations JOIN Stations ON (orga=code)
GROUP BY name
HAVING COUNT(*) >= 3;
SELECT
FROM
WHERE
GROUP BY
HAVING
Order by
string_agg() is a way to concatenate strings vertically. string_agg(attribute,
delimiter). attribute stands for the name of the column you are going to concatenate,
and delimiter specifies how you would like to separate each string.
examples
SELECT 2000 AS year, string_agg(title, ', ') AS film_titles
FROM Films
WHERE release_year = 2000;
SELECT 2000 as year, string_agg(title, ', ' ORDER BY title) AS film_titles
FROM Films
WHERE release_year = 2000;
SELECT string_agg(DISTINCT nationality, ', ') AS actor_nationalities
FROM Films NATURAL JOIN Film_Actor NATURAL JOIN Actors
WHERE release_year = 2000;
array_agg() is an option to combine string values from one group into an array
attribute
SELECT release_year, array_agg(title) AS film_titles
FROM Films
GROUP BY release_year
ORDER BY release_year;
examples
SELECT C.name as "category", count(F.film_id) as "num_films",
string_agg(DISTINCT F.title,';' ORDER BY F.title)as "film_list"
FROM Films F NATURAL JOIN Film_Category JOIN Category C
using (category_id)
GROUP BY C.name
ORDER BY "num_films" DESC, C.name ASC;