COMP5310-无代写
时间:2023-06-05
TheUniversity of Sydney Page 1
COMP5310: Principles of
Data Science
W13: Review
Presented by Ali Anaissi
TheUniversity of Sydney Page 2
Questions and suggestions
– Please complete the survey
– Browse to https://student-surveys.sydney.edu.au/students/
– Log in if you aren’t already
– Complete survey for COMP5310
TheUniversity of Sydney Page 3
Assessment Package
TheUniversity of Sydney Page 4
Assessment
– 10%: Project stage 1
– 10%: Project stage 2A
– 15%: Project stage 2B
– 5%: Project stage 3
– 60%: Final exam
TheUniversity of Sydney Page 5
Final Exam
Objective
Assessunderstanding of unit material,
ability to frame data problems
scientifically and critical thinking about
claims made based on data
Content
– Answer questions about lecture
material and readings
– Describe an approach to answering a
question with data
– Critique a claim made based on data
Format
– Written examination
– Monitored by ProctorU
– Non-programmable calculators permitted
– Open book, that means you can have
access to your local files and handwritten
notes. On your browser, you can only
access to the Canvas site during the
exam.
– Betterto summarize on 1 x A4 page of own
notes double sided
SITpolicy: Youmust get 40% on the exam and
50% overall to passCOMP5310
TheUniversity of Sydney Page 6
Exam Format
– The exam will be 23 questions – 2 hours
– MCQ (1 question)
– Numericanddescriptive Questions (22 Short & Long
answers questions)
– It will draw from lectures and exercises
– Studying
– Review slides and notebook solutions
– No Python
– Yes SQL
Check the
Sample Quiz
in Canvas
https://canvas.sydney.edu.
au/courses/47855/quizzes
/213277
Check the instruction on
how and where to upload
file.
TheUniversity of Sydney Page 8
Data Processing and Management
Water Database Schema
Measurement
Station
Sensor
Date
Value
Sensor
Sensor
Metric
Description
Station
Station
SiteName
Lon
Lat
Commence
OrgCode
Organisation
Code
Organisation
fact table
TheUniversity of Sydney Page 10
dimensiontables
Four tables as shown below including foreign-key relationships
TheUniversity of Sydney Page 11
SELECTStatement Options
SQLStatement Meaning
SELECTCOUNT(*) FROMT count how many tuples are stored in table T
SELECT* FROMT list the content of table T
SELECT* FROMTLIMIT n only list n tuples from a table
SELECT* FROMTORDERBYa order the result by attribute a (in ascending order;
add DESCfor descending order)
TheUniversity of Sydney Page 12
SQL Aggregate Functions
SQL Aggregate Function Meaning
COUNT(attr) ; COUNT(*) Number of Not-null-attr ; or of all values
MIN(attr) Minimum value of attr
MAX(attr) Maximum value of attr
AVG(attr) Average value of attr (arithmetic mean)
MODE() WITHIN GROUP (ORDERBYattr) mode function over attr
PERCENTILE_DISC(0.5) WITHIN GROUP
(ORDERBYattr)
median of the attr values
… …
TheUniversity of Sydney Page 13
Queries with GROUP BYand HAVING
– In SQL, we can “partition” a relation into groups according to the value(s) of one
or more attributes:SELECT [DISTINCT] target-list
relation-listFROM
WHERE
GROUP BY
HAVING
qualification
grouping-list
group-qualification
– A group is a set of tuples that have the same value for all attributes in grouping-
list.
– Note: Attributes in select clause outside of aggregate functions must appear in
the grouping-list
– Intuitively, each answer tuple corresponds to a group, and these attributes must have
a single value per group.
TheUniversity of Sydney Page 14
Recommended studying
– Review and understand slides and exercises
– Querying and summarizing data with SQL
– SQL join operators
– Filtering groups with having clause
– DISTINCT
TheUniversity of Sydney Page 14
Statistical Inference
TheUniversity of Sydney Page 16
Research question
– Research question (Q):
– Asks whether the independent variable has an effect
– “If there is a change in the independent variable, will there also be a
change in the dependent variable?”
– Null hypothesis (H0):
– The assumption that there is no effect
– “There is no change in the dependent variable when the independent
variable changes.”
TheUniversity of Sydney Page 17
Hypothesis testing
– We use it to specify whether to accept or reject a claim about a population
depending on the evidence provided by a sample of data.
– A hypothesis test examines two opposing hypotheses about a population:
• The null hypothesis
• The alternative hypothesis
– A statistical test is often used to determine whether the mean of a
population significantly differs from a specific value or from the
mean of another population.
Testing reliability with p-values
– Most tests calculate a p-value measuring observation extremity
– Compare to significance level threshold α
– α is the probability of rejecting H0 given that it is true
– Commonly use α of 5% or 1%
P-value Indicates Reject H0?
≤α Strong evidence against the null hypothesis Yes
>α Weak evidence against the null hypothesis No
=α Marginal NA
TheUniversity of Sydney Page 18
TheUniversity of Sydney Page 19
Recommended studying
– Review and understand slides and exercises
– Examples scenarios
– Research questions
– Null/alternative hypotheses
– Statistical tests
TheUniversity of Sydney Page 19
Clustering
Data Clustering
– Goal: group objects into clusters of similar data
– Note: No exact answer possible
– Types of Clusterings
– Partitional clustering
A division data objects into non-overlapping subsets (clusters) such that each data object
is in exactly one subset
– Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
How many clusters..?
TheUniversity of Sydney Page 21
Hierarchical AgglomerativeClustering
– Initial
– Each point in its own cluster
– Repeat
– Find closest pair of clusters
• Min-distance between
any two points (single linkage)
– Merge them into one cluster
– Recompute distances
between new cluster and
others
– Until
– Desired number of clusters remaining e.g. single cluster
TheUniversity of Sydney Page 22
TheUniversity of Sydney Page 23
Partitional clustering: K-MeansClustering
Method
– Given k, the k-means algorithm is implemented in four steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point, of the cluster)
– Assign each object to the cluster with the nearest seed point
– Go back to Step 2, stop when the assignment does not change
Using Silhouettes to choose k
High average silhouette
indicates points far away
from neighbouring clusters
TheUniversity of Sydney Page 24
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
TheUniversity of Sydney Page 25
Exercise:
BOS NY DC MIA CHI SEA SF LA DEN
BOS 0 206 429 1504 963 2976 3095 2979 1949
NY 206 0 233 1308 802 2815 2934 2786 1771
DC 429 233 0 1075 671 2684 2799 2631 1616
MIA 1504 1308 1075 0 1329 3273 3053 2687 2037
CHI 963 802 671 1329 0 2013 2142 2054 996
SEA 2976 2815 2684 3273 2013 0 808 1131 1307
SF 3095 2934 2799 3053 2142 808 0 379 1235
LA 2979 2786 2631 2687 2054 1131 379 0 1059
DEN 1949 1771 1616 2037 996 1307 1235 1059 0
• The following pages trace a hierarchical
clustering of distances in miles between
U.S. cities.
• The method of clustering is single-link.
• Use Agglomerative hierarchical
clustering algorithm to cluster the
following data.
• Illustrate every step of the merging
process using the calculation table in
your report.
• Draw the final dendrogram
TheUniversity of Sydney Page 25
Association Rule
TheUniversity of Sydney Page 27
Association Rule Mining
– Predict occurrence of an
item based on other items in
the transaction, eg:
{Diaper}à {Beer}
{Milk,Bread}à {Eggs,Coke}
{Beer,Bread}à {Milk}
– Note that arrows indicate
co-occurrence, not causality
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Market-basket transactions
TID: Transaction Identifier
Items: Transaction item set
Definition: Frequent Itemset
– Support count (σ) is the
itemset frequency
– Support (s) is the normalised
itemset frequency
– A frequent itemset has
s≥ min_support
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Market-basket transactions
TID: Transaction Identifier
Items: Transaction item set
TheUniversity of Sydney Page 28
Definition: Association Rule
– An association rule is an
implication of the form XàY
where X and Y are itemsets
{Milk,Diaper}à{Beer}
– Confidence (c) measures
how often Y occurs in
transactions with X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Market-basket transactions
TID: Transaction Identifier
Items: Transaction item set
TheUniversity of Sydney Page 29
TheUniversity of Sydney Page 30
Mining Association Rules
– Association Analysis
– frequent itemsets
– association rules as implication between frequent itemsets: XàY
– measures: frequency, support, confidence
– Mining Association Rules
– brute force enumeration is computational prohibitive, hence:
– FP-growth or Apriori algorithm for generating frequent itemsets
TheUniversity of Sydney Page 31
Exercise:
Transaction Records
Transaction ID Items
#1 apple, banana, coca-cola, doughnut
#2 banana, coco-cola
#3 banana, doughnut
#4 apple, coca-cola
#5 apple, banana, doughnut
#6 apple, banana, coca-cola
1. Build the FP-tree using a minimum support
min_sup= 2. Show how the tree evolves for
each transaction.
2. Use the FP-Growth algorithm to discover
frequent itemsets from the FP-tree.
3. With the previous transaction record, Use the
Apriori algorithm on this dataset and verify
that it will generate the same set of frequent
itemsets with min_sup = 2.
4. Suppose that { Apple, Banana, Doughnut }
is a frequent item set, derive all its
association rules with min_confidence= 70%
TheUniversity of Sydney Page 32
Recommended studying
– Review and understand slides and exercises
– Data Clustering with k-means
– Hierarchical AgglomerativeClustering
– Evaluating Clustering, how to choose k
– Association Rule Mining: FP-growth and Apriori
TheUniversity of Sydney Page 36
Information Gain
TheUniversity of Sydney Page 34
Information Gain (IG)
– IG calculates effective change in entropy after making a decision based on
the value of an attribute.(|) = () – (|)
Where:
• is a class label.
• is an attribute.
• () is the entropy of .
• (|) is the conditional entropy of given .
TheUniversity of Sydney Page 35
Information Gain (IG)
– Entropy: Tomeasure the uncertainty associated with data:H Y = −.!"#$ p! log%(p!)
Where & = ( = &), and is the number of classes.
– Interpretation:
• Higher entropy => higher uncertainty
• Lower entropy => lower uncertainty
Conditional Entropy: H(Y|X): the average conditional entropy of Y.(|) =.& = & ∗ (| = &)
TheUniversity of Sydney Page 36
Exercise
– What is the information gain of X
relative to these training examples?
given that Y is the label
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
TheUniversity of Sydney Page 40
Decision trees
New data:
TheUniversity of Sydney Page 38
An Example
– Training data: interviewee data
– Four features:
• Level , Lang, Tweets, PhD
– Class label:
• Interviewed well
– I have new applicant A15
(Level, Lang, Tweets, PhD)
– Want to predict whether
Interviewed well is True or False
– Hard to guess for A15!
Class label
Applicant Level Lang Tweets PhD Interviewed well
A1 Senior Java No No False
A2 Senior Java No Yes False
A3 Mid Java No No True
A4 Junior Python No No True
A5 Junior R Yes No True
A6 Junior R Yes Yes False
A7 Mid R Yes Yes True
A8 Senior Python No No False
A9 Senior R Yes No True
A10 Junior Python Yes No True
A11 Senior Python Yes Yes True
A12 Mid Python No Yes True
A13 Mid Java Yes No True
A14 Junior Python No Yes False
A15 Senior R No No ?
Training examples: 9 True/ 5 False
New data:
TheUniversity of Sydney Page 39
Predict if A15 belong to
True or False
– Divide-and-conquer
– Choose attributes to split the data
into subsets
– Are they pure?(all True or all False)
– If yes: stop
– If no: repeat
– Which attributes to choose?
– Lets try select “Level” attribute as
first
Class label
Applicant Level Lang Tweets PhD Interviewed well
A1 Senior Java No No False
A2 Senior Java No Yes False
A3 Mid Java No No True
A4 Junior Python No No True
A5 Junior R Yes No True
A6 Junior R Yes Yes False
A7 Mid R Yes Yes True
A8 Senior Python No No False
A9 Senior R Yes No True
A10 Junior Python Yes No True
A11 Senior Python Yes Yes True
A12 Mid Python No Yes True
A13 Mid Java Yes No True
A14 Junior Python No Yes False
A15 Senior R No No ?
Training examples: 9 True/ 5 False
Decision Tree
Level Lang Tweets PhD Interviewed well
Senior Java No No False
Senior Java No Yes False
Senior Python No No False
Senior R Yes No True
Senior Python Yes Yes True
Mid Java No No True
Mid R Yes Yes True
Mid Python No Yes True
Mid Java Yes No True
Junior Python No No True
Junior R Yes No True
Junior Python Yes No True
Junior R Yes Yes False
Junior Python No Yes False
Senior
Mid
Junior
Yes No Yes
Tweets? PhD?
No
3 False2 True 3 True 2 False
4 True
TheUniversity of Sydney Page 40
3 False/2 True 3 True/ 2 False
9 True/ 5 False
Level?
TheUniversity of Sydney Page 44
Naïve Bayes Classifier
TheUniversity of Sydney Page 42
An Example
Class:
C1: Interviewed well = ‘False’ C2:
Interviewed well = ‘True’
Data to be classified: X =
(Level =Senior,
Lang = Python,
Tweets = yes
PhD= No)
Level Lang Tweets PhD Interviewed well
Senior Java No No False
Senior Java No Yes False
Mid Java No No True
Junior Python No No True
Junior R Yes No True
Junior R Yes Yes False
Mid R Yes Yes True
Senior Python No No False
Senior R Yes No True
Junior Python Yes No True
Senior Python Yes Yes True
Mid Python No Yes True
Mid Java Yes No True
Junior Python No Yes False
TheUniversity of Sydney Page 43
Recommended studying
– Review and understand slides and exercises
– Single and multiple linear regression
– Decision trees
– Naïve Bayes classifier
– Information gain
– Evaluation and setup
TheUniversity of Sydney Page 44
Final Activities
– No exercises for week13
– Housekeeping (before exam):
– Reviewyour marks and assignment results onCanvas
– Feedback:
– Submit USSsurvey (https://student-surveys.sydney.edu.au/students/)
– Revision:
– Review lecturematerial and cross-checkwith topics highlighted inWeek 13
– Review/attempt all exercises and check the example solutions
– If needed: Arrange consultation sessions
– alternatively: UseEd to askquestions
– All thebest in your exam