FIT3152-无代写
时间:2023-05-12
Faculty of
Information
Technology
FIT3152 Data analytics – 2023
Quiz and Practical Activity – Sample Questions
Your task • You will be given a set of multiple choice and longer questions to answer.
The questions will cover topics taught during Weeks 1 – 9.
Value and
Structure
• This assignment is worth 25% of your total marks for the unit.
• It has 30 marks in total, comprised of
• 6 multiple choice questions of 1 Mark each,
• 3 free responses of 2 Marks each, and
• 3 grouped free responses of 6 Marks each.
Time • You will have 1 Hour during tutorial time to complete the test.
Due Date Your scheduled tutorial during Week 11
Submission • Via Moodle Quiz
Generative
AI Use
• In this assessment, you must not use generative artificial intelligence (AI) to
generate any materials or content in relation to the assessment task.
Late
Penalties
• This activity can only be deferred/re-scheduled on medical or other serious
grounds with relevant documentation.
Instructions • Answer the questions on the Moodle Quiz.
• The activity is closed book, therefore lecture and tutorial notes or online
references are not permitted.
• You may use any calculator (physical or digital).
• You must keep your camera on if you are in an online tutorial.
NOTE You will be asked to stop this activity early and submit what you have done if:
• You are found to be using any software other than that permitted.
• You are found to be accessing web sites or online resources other than the
Moodle Quiz.
• You are found to be communicating with any other student.
• You are found using online resources besides the Moodle Quiz.
• You are found to be cheating in any way.
2
Multiple Choice (1 Mark)
The following points (P1 – P6) are to be clustered using hierarchical clustering and applying MIN to
the distance matrix below. Which pair of points are in the first merge?
A. P2, P4
B. P3, P4
C. P1, P6
D. P1, P4
E. P4, P5
P1 P2 P3 P4 P5 P6
P1 0.0 0.4 2.5 1.5 1.4 0.2
P2 0.0 0.4 3.9 1.7 0.6
P3 0.0 2.8 0.8 1.9
P4 0.0 0.1 2.0
P5 0.0 1.3
P6 0.0
3
Multiple Choice (1 Mark)
The table below shows a classification model for 10 customers based on whether or not they did
buy a new product (did buy = 1, did not buy = 0), and the confidence level of the prediction.
Customer Confidence-buy Did-buy
C01 0.8823 0
C02 0.5547 0
C03 0.6469 1
C04 0.1252 0
C05 0.7050 0
C06 0.7065 1
C07 0.1441 0
C08 0.7398 1
C09 0.7865 1
C10 0.4874 0
What is the lift value if you target the top 50% of customers that the classifier is most confident of?
A. 0.2
B. 0.5
C. 1.5
D. 2.0
E. 2.5
4
Multiple Choice (1 Mark)
The ROC chart for a classification problem is given below.
Give an estimate of classifier performance (AUC).
A. 0.1
B. 0.2
C. 0.5
D. 0.6
E. 0.8
5
Multiple Choice (1 Mark)
15 observations were sampled at random from the Iris data set. The dendrogram resulting from
clustering, based on their sepal and petal measurements, is below.
What is the smallest number of clusters that would put all of species Setosa (observations 1:50) in
a cluster their own.
A. 1
B. 2
C. 3
D. 5
E. 15
6
Multiple Choice (1 Mark)
Predict the output from the following commands:
> X <- c(1, 2)
> Y <- c(3, 4)
> X + Y
A. 4, 6
B. 3, 7
C. 10
D. 1234
E. 1, 2, 3, 4
7
Multiple Choice (1 Mark)
An artificial neural network (ANN) is to be used to classify whether or not to Buy a certain product
based on Popularity, Sales and Performance. An extract of the data is below.
How many input nodes does the ANN require for this problem? [1 Mark]
A. 1
B. 2
C. 3
D. 4
E. 5
ID Popularity Sales Performance Buy
1 low 330000 0.87 Maybe
2 medium 40000 0.22 No
3 low 50000 NA Yes
4 high 30000 0 Yes
5 low 100000 0.1 No
6 medium NA 0.06 No
... ... ... ... ...
8
Free Response (2 Marks)
The table below shows a classification model for 10 customers based on whether or not they did
buy a new product (did buy = 1, did not buy = 0), and the confidence level of the prediction.
Customer Confidence-buy Did-buy
C01 0.8823 0
C02 0.5547 0
C03 0.6469 1
C04 0.1252 0
C05 0.7050 0
C06 0.7065 1
C07 0.1441 0
C08 0.7398 1
C09 0.7865 1
C10 0.4874 0
If a confidence level of 50% or greater is required for a positive classification, what is the Accuracy
of the model?
9
Free Response (2 Marks)
A k-Means clustering algorithm is fitted to the iris data, as shown below.
rm(list = ls())
data("iris")
ikfit = kmeans(iris[,1:2], 4, nstart = 10)
ikfit
table(actual = iris$Species, fitted = ikfit$cluster)
Based on the R code and output below, answer the following questions.
> ikfit
K-means clustering with 4 clusters of sizes 24, 53, 41, 32
Cluster means:
Sepal.Length Sepal.Width
1 4.766667 2.891667
2 5.924528 2.750943
3 6.880488 3.097561
4 5.187500 3.637500
Within cluster sum of squares by cluster:
[1] 4.451667 8.250566 10.634146 4.630000
(between_SS / total_SS = 78.6 %)
> table(actual = iris$Species, fitted = ikfit$cluster)
fitted
actual 1 2 3 4
setosa 18 0 0 32
versicolor 5 34 11 0
virginica 1 19 30 0
If clustering was used to discriminate between the irises, what would be the accuracy of the
model? Explain your reasoning.
10
Free Response (2 Marks)
Use the data below and Naïve Bayes classification to predict whether the following test instance
will be happy or not.
Test instance: (Age Range = young, Occupation = professor, Gender = F, Happy = ? )
ID Age Range Occupation Gender Happy
1 Young Tutor F Yes
2 Middle-aged Professor F No
3 Old Tutor M Yes
4 Middle-aged professor M Yes
5 Old Tutor F Yes
6 Young Lecturer M No
7 Middle-aged lecturer F No
8 Old Tutor F No
11
Free Response (6 Marks)
The DunHumby (DH) data frame records the Date a Customer shops at a store, the number of Days
since their last shopping visit, and amount Spent for 20 customers. The first 4 rows are shown below.
> head(DH)
customer_id visit_date visit_delta visit_spend
1 40 04-04-10 NA 44.8
2 40 06-04-10 2 69.7
3 40 19-04-10 13 44.6
4 40 01-05-10 12 30.4
The following R code is run:
DHY = DH[as.Date(DH$visit_date,"%d-%m-%y") < as.Date("01-01-11","%d-%m-%y"),]
CustSpend = as.table(by(DHY$visit_spend, DHY$customer_id, sum))
CustSpend = sort(CustSpend, decreasing = TRUE)
CustSpend = head(CustSpend, 12)
CustSpend = as.data.frame(CustSpend)
colnames(CustSpend) = c("customer_id", "amtspent")
DHYZ = DHY[(DHY$customer_id %in% CustSpend$customer_id),]
write.csv(DHYZ, "DHYZ.csv", row.names = FALSE)
g = ggplot(data = DHYZ) + geom_histogram(mapping = aes(x = visit_spend)) +
facet_wrap(~ customer_id, nrow = 3)
Describe the data contained in the data frame “CustSpend.” [2 Marks]
Describe the data contained in the data frame “DHYZ.” [2 Marks]
Describe the contents of the graphic shown by plot “g.” [2 Marks]
12
Free Response (6 Marks)
A World Health study is examining how life expectancy varies between men and women in different
countries and at different times in history. The table below shows a sample of the data that has
been recorded. There are approximately 15,000 records in all.
Country Year of Birth Gender Age at Death
Australia 1818 M 9
Afghanistan 1944 F 40
USA 1846 F 12
India 1926 F 6
China 1860 F 32
India 1868 M 54
Australia 1900 F 37
China 1875 F 75
England 1807 M 15
France 1933 M 52
Egypt 1836 M 19
USA 1906 M 58
… … … …
Using one of the graphic types from the Visualization Zoo (see formulae and references for a list of
types), or another graph type of your choosing, suggest a suitable graphic to help the researcher
display as many variables as clearly as possible.
Explain your decision. Which graph elements correspond to the variables you want to display?
13
Free Response (6 Marks)
A researcher wants to predict the prevalence of crime in towns, using the following data.
Crm: Crime rate in the town;
Ind: Proportion of the town zoned industrial.
Pol: Air pollution in the town (ppm)
Rms: Number of main rooms in the house
Tax: Land tax paid ($)
Str: Student to teacher ratio in local schools
Zone: Socio-economic zone of house location
Val: Value of the house ($000)
> head(Cdata)
Crm Ind Pol Rms Tax Str Zone Val
1 0.00632 2.31 0.538 6 296 15.3 0 2400
2 0.02731 7.07 0.469 6 242 17.8 1 2160
3 0.02729 7.07 0.469 7 242 17.8 0 3470
4 0.03237 2.18 0.458 6 222 18.7 0 3340
Based on the R code and output below, answer the following questions.
> contrasts(Cdata$Zone) = contr.treatment(3)
> Crime = lm(Crm~.,data = Cdata); summary(Crime)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.162875 5.324457 -0.97 0.333
Ind -0.160716 0.078481 -2.05 0.041 *
Pol 4.791271 4.443372 1.08 0.281
Rms 0.051432 0.500037 0.10 0.918
Tax 0.025699 0.002902 8.86 <2e-16 ***
Str 0.041439 0.177346 0.23 0.815
Zone1 -1.843825 1.198360 -1.54 0.125
Zone2 3.244316 1.702931 1.91 0.057 .
Val -0.001216 0.000582 -2.09 0.037 *
---
> contrasts(Cdata$Zone)
2 3
0 0 0
1 1 0
How does the proportion of the town zoned industrial affect crime rate? How reliable is the
evidence?
How does air pollution affect crime rates? How reliable is the evidence?
Why is Zone ‘0’ not defined in the regression output? How is it included in the model?
14
Free Response (6 Marks) – Extra Example!
The table below shows the survey results from 12 people, who were asked whether they would
accept a job offer based on the attributes: Salary, Distance, and Social. We want to build a
decision tree to assist with future decisions of whether a person would accept a Job or not.
ID Salary Distance Social Job
1 Medium Far Poor No
2 High Far Good Yes
3 Low Near Poor No
4 Medium Moderate Good Yes
5 High Far Poor Yes
6 Medium Far Good Yes
7 Medium Moderate Poor No
8 Medium Near Good Yes
9 High Moderate Poor Yes
10 Medium Near Poor Yes
11 Medium Moderate Poor Yes
12 Low Moderate Good No
What is the entropy of Job?
Without calculating information gain, which attribute would you choose to be the root of the
decision tree? Explain why.
What is the information gain of the attribute you chose for the previous question?
15
Formulas and references
The Visualization Zoo – Graphic Types
Time-Series Data
• Index Charts
• Stacked Graphs
• Small Multiples
• Horizon Graphs
Statistical Distributions
• Stem-and-Leaf Plots
• Q-Q Plots
• SPLOM
• Parallel Coordinates
Maps
• Flow Maps
• Choropleth Maps
• Graduated Symbol Maps
• Cartograms
Hierarchies
• Node-Link diagrams
• Adjacency Diagrams
• Enclosure Diagrams
Networks
• Force-Directed Layouts
• Arc Diagrams
• Matrix Views
Entropy
If S is an arbitrary collection of examples with a
binary class attribute, then:
() = −!"#(!")−!##(!#)
= −!" # 2!" 3 − !# # 2!# 3
where 1 2 are the two classes. !" !# are the probability of being in
Class 1 or Class 2 respectively. !" !# are
the number of examples in each class. is the
total number of examples.
Note: # = $%&!"'$%&!"# = $%&!"'(.*("
Information gain
The (, ) of an attribute A relative to a
collection of examples, S, with v groups having |+| elements is:
(, ) = () − 3 |!|||!∈#$%&'((*) ∗ (!)
Accuracy = + + + +
ROC = + , = +
Naïve Bayes’
", #, … , , , the
classification probability is ,#|$ ∩ %…∩ &1 = ,#1 ∙ ,$ ∩ %…∩ &|#1($ ∩ %…∩ &)
For Bayesian classification, a new point is
classified to - if ,#1 ∗ P,$|#1 ∗ P,$|#1 ∗ …∗P,'|#1 is maximised.
Naïve Bayes assumes ( ∩ ) = () ∗ () etc.