MAST30034-Python代写|学霸联盟

MAST30034-Python代写

时间：2023-09-15

MAST30034 Applied Data Science Lecture 7
DATA CLEANING
Dr. Liam Hodgkinson
School of Mathematics and Statistics
The University of Melbourne
September 5, 2023
Pipeline
1. Landing Layer (1-1 with upstream)
2. Raw Layer (w/ basic transformations e.g.
column renaming, data type conversions)
3. Curated Layer (filtered/transformed for
analysis)
4. Development / Analysis Layer (analysis or
further development)
Data
Pipeline
Pipeline
1. Landing Layer (1-1 with upstream)
2. Raw Layer (w/ basic transformations e.g.
column renaming, data type conversions)
3. Curated Layer (filtered/transformed for
analysis)
4. Development / Analysis Layer (analysis or
further development)
Data Cleaning
Data Cleaning
Cleaning can be an expensive process, so
improvements to prevent dirty data from
upstream is ideal.
However... data cleaning is inevitable.
Often a behind the scenes art that consumes a
lot of hidden labor in a project
Data Quality Problems
▶ Single-source problems (issues
within a particular data source)
▶ Multi-source problems (issues
arising when combining data
sources)
Single-source problems
Schema level:
Some illegal values can be detected /
addressed with effective schema (Tutorial 2).
▶ Number field: e.g. age should not be ‘N/A’
▶ Boolean field: restrict to yes/no
▶ Date field: incorrect date, e.g. DOB =
25/13/1991
Single-source problems
Instance level (automatic checks):
▶ range: assert values in permissible range
(min/max), e.g. # of passengers > 0
▶ references: check if reference is defined
(e.g. postcode or department code)
▶ dependencies: if two fields depend on
each other, check the dependency is valid
(e.g. age / DOB).
▶ duplicates: # of unique IDs = # rows
Single-source problems
Instance level (automatic checks):
▶ different representations: e.g. “J. Smith”
vs. “John Smith”. Set to common
representation, look at another table if
need be.
▶ outliers: flag any obvious strange outliers
▶ missing values: look for entries with NaN,
inf, or 0
Lots of research into these last two scenarios...
Outlier Detection
Manual approaches:
▶ check minimum and maximum are sane
(best practice)
▶ visualize with boxplot (not good for lots of
data)
Outlier Detection
Automatic approaches: (be careful with
these...)
▶ compute Z-scores: for N records,
▶ |Z| > 3 unlikely if N ⩽ 100
▶ |Z| > √2 logN unlikely if N > 100.
▶ IQR rule: for N records, flag if
▶ 1.5× IQR away from IQR if N ⩽ 100
▶ (
√
logN− 0.5)× IQR away from IQR if N > 100.
▶ use data drift method for streaming data
(next week)
Missing Data
Fields are invalid / left empty.
Terminology
▶ Missing completely at random
(MCAR)
▶ Missing at random (MAR)
▶ Missing not at random (MNAR)
Options for Missing Data
1. Listwise deletion: Remove entries
with missing data
2. Imputation: replacing missing data
with substituted values
▶ using mean/median — fast!
▶ modelling for missing data with ML
A Strategy for Missing Data
Are missing values more common
for one group than another?
Yes
Do you have a lot of data?
No
Imputation (ML)
Yes
Listwise deletion
Imputation
(average)
No
vs.
Imputation methods
▶ Imputation by ML:
▶ Remove entries with missing values
▶ Fit a model (e.g. linear regression)
▶ Predict missing entries using the model
▶ Small datasets: sklearn.impute.KNNImputer
▶ Imputation (averaging):
▶ PySpark: pyspark.ml.feature.Imputer
▶ scikit-learn: sklearn.impute.SimpleImputer
Single-source problems
Instance level (manual checks):
▶ Misspellings: sort values — brings
misspelled values next to correct ones
▶ Missing values: default values — may
indicate real value is missing but not
always
▶ Duplicates: sort values — more than one
similar occurrence indicates duplicate
Multi-source problems
Always use a consistent schema when
combining datasets
▶ Standardization: Ensure entries across all
datasets are in a consistent format first!
▶ Conflicts:
▶ if identical, then merge entries / purge
redundant entries
▶ otherwise, determine most valid entry by a
common rule, e.g. date of entry
Fuzzy matching
Entries may be sufficiently similar to be
duplicates, but not identical
Solution: Track how close one string is to
another — if sufficiently close, consider them
to be a duplicate
e.g. John Smith, 45 Remington Way
vs.
John Smth, 4 Remington Way
Levenshtein Distance
minimum number of single-character
insertions / deletions / substitutions required
to change one word into the other
1. send to lowercase
2. remove stopwords (e.g. the)
3. compute the Levenshtein distance
(pyspark.sql.functions.levenshtein)
4. compute ratio: R = 1− Llen(s1)+len(s2)
5. if R > 0.9 (for example), consider equal
Levenshtein Distance
e.g. John Smith, 45 Remington Way
vs.
John Smth, 4 Remington Way
L = 2, R = 0.96
General Advice
▶ Seek automation (via data transformation),
but be prepared to get your hands dirty on
smaller datasets.
▶ Early steps should focus on single-source,
later steps on multi-source.
▶ Always validate your workflow on a sample
of the source data (e.g. corrupt or remove
some entries to test)
▶ Be wary of cloud services for data
laundering (privacy concerns).