Python代写-DATA1002|学霸联盟

Python代写-DATA1002

时间：2021-11-18

Page 1 of 13
DATA1002 Extra Exam preparation.
Adapted from questions from previous years’ exams
These questions are provided for students to do for practice, and to see some of the
variety of types of question, and the range of essential content for the unit. We have
indicated, for each question, the time a student should expect to have available for
reading and typing their answer, if they are working at the same speed as will be needed
in the final exam. In addition, we have provided online in Canvas: a multichoice quiz, and
four other questions which students are expected to attempt during their lab and tutorial
session in week 13.
Q1. [20 minutes; 2 minutes per sub-part]
1(a) Give an example of an original source of data (as distinguished from a secondary
source)
1(b) Explain what is meant by metadata, and give one example of metadata that could
be appropriate for a dataset of weather observations.
1(c) What is ASCII used for?
1(d) Give an example of how overflow can occur in an arithmetic operation.
1(e) Describe a data quality issue that can arise from use of default values in a dataset.
1(f) Describe one way to deal with the situation where some value is missing in a
dataset.
1(g) Who is responsible for making sure backups are taken and usable, for data held in
files on organizational systems (eg department-owned machines)?
1(h) Explain the difference between an access control policy, and an access control
mechanism.
1(i) Give an example of a situation where it is appropriate to have content-dependent
access control for some data.
1(j) Describe how a version control system can be used to help make a data science
analysis reproducible.

Page 2 of 13
Q2 [10 minutes]
Consider the 8-bit data value 11111010.
2(a) [2 minutes] Write this value in hexadecimal notation.
2(b) [2 minutes] How many bytes are needed to store this value.
2(c) [3 minutes] What number has this bit pattern as its representation, in the unsigned
binary representation.
2(d) [3 minutes] What number has this bit pattern as its representation, in the (two’s
complement) signed binary representation.

Q3 [20 minutes; 2 minutes per sub-part]
3(a) List the main types of activity that often occur within a data science lifecycle, where
the goal is understanding some situation.
3(b) Give an example of an ethical issue that can arise in doing a data science project.
3(c) Explain how the contents of a spreadsheet cell might be different from the value of
that cell.
3(d) Suppose that a spreadsheet cell G5 contains the formula =SUM($C1:E20) and
this formula is then copy/pasted into cell H6. What does H6 now contain?
3(e) Consider the fragment of a spreadsheet shown below. What is the result of the
formula =MATCH(A1,A2:G2,0)? Answer:
A B C D E F G H I
1 10
2 6 15 -2 10 3 8 7 0

Page 3 of 13
3(f) Which operations can be used to compare values from a nominal (also called
categorical or discrete) attribute?
3(g) Which of the following tasks are considered as unsupervised learning? Classification,
clustering, regression.
3(h) What of the following tasks is k-means used for? Classification, clustering,
regression.
3(i) Explain the use of a “test set” in machine learning.
3(j) Explain the difference between a conventional recommender, and a reciprocal
recommender.
Q4 [10 minutes; 2 minutes per sub-part]
4(a) What symbol occurs at the end of each control-flow construct of Python (such as if,
else, for)?
4(b) In Python, how is the body of a control structure shown?
4(c) Explain what is meant by saying that a Python list is mutable.
4(d) Describe the structure of the data in a Pandas dataframe.
4(e) What keyword is used to start the code of a Python function?

Page 4 of 13
Q5 [15 minutes]
You have been given a text file products.csv containing lines of comma-separated
data about some household products. The first few lines look like this (note that the first
line is a header, and also note that the fields do not themselves contain any commas):
prodID,makerName,energyScore,category
37089,Artem,3,dishwasher
47115,Goldrod,2,fridge
51092,4Star,1,dryer
53490,FASTAr,3,fridge
5(a) [5 minutes] We would like to find the prodID and energyScore, for each product in
the category “dishwasher”. Write a Python program that accesses products.csv and
prints out the desired information. You do not need to deal with misformatted files or
other errors. You are allowed to use a library like Pandas, but this is not required.

5(b) [10 minutes] We would like to find, for each combination of a maker whose name
ends with “ar”, and a category, how many products there are (that have the given
category and are produced by the particular maker) and which have energyscore of 3 or
greater. You should ignore case in maker names; so “FASTAr” and “Fastar” and “fastar”
are all considered the same maker. Write a Python program that accesses
products.csv and prints out the desired information. You do not need to deal with
misformatted files or other errors. You are allowed to use a library like Pandas, but this is
not required.

Page 5 of 13
Q6 [15 minutes]
Here is a data set with information about household products.
Product
Code
Product
Type
Capacity
(cu cm)
Efficiency
(%) Quality
2113 Widget 13.7 70 Low
2120 Widget 13.2 80 Medium
2134 Widget 12.5 87 High
3352 Blodge 11.6 75 Low
3364 Blodge 14.2 80 Medium
3371 Blodge 15 84 High
4447 Wodget 23.1 98 High
4451 Wodget 21.7 93 Medium
4465 Wodget 19.2 89 Low

Fred Foolish has produced the following chart, displaying some (but not all) of the data
from this table as a stacked barchart. The goal of the visualisation is to provide insight
into the relationship between efficiency and quality, in particular, how this relationship
might vary between product types. Fred has chosen to encode the Efficiency (which is a
quantitative or numeric attribute) as the height (length) of each bar, the
nominal/categorical Product Type is encoded on the x-axis, and then the ordinal Quality
attribute is encoded by the pattern.

Page 6 of 13

6(a) [4 minutes] Identify some aspects of Fred’s visualisation that do not make it easy for
a reader to gain insight into the relationship between efficiency and quality, in particular,
how this relationship might vary between product types.
6(b) [11 minutes] Propose a different visualisation of some of the data from the table
above, that will provide better insight into the relationship between efficiency and
quality, in particular, how this relationship might vary between product types. You
should structure your answer in three subparts: (i) describe which attributes will be
shown and how each of these attributes will be encoded, (ii) sketch how the visualisation
will look (you do not need to place the marks in exact positions for the given data), and
(iii) explain why your proposal conveys the important information more clearly than
using Fred’s encoding.
0
50
100
150
200
250
300
Blodge Widget Wodget
HighMediumLow

Page 7 of 13
Q7 [10 minutes]
In Stage2 of the group project, you produced a chart that shows the relationship among
several attributes of the dataset. Write a description aimed at a student entering
university, about some knowledge that you learned in DATA1002 that was useful in
creating an effective chart. You should provide concrete details of how you used the
knowledge from the unit, in the steps which created the chart, as well as how the
knowledge could be useful in other situations the student might meet during their studies.
Q8. [30 minutes; 2 minutes per sub-part]
8(a) Explain the meaning of “provenance of a dataset”
8(b) Write a formula that can be placed in a cell A5 of an Excel spreadsheet, so the value
of A5 is equal to the value in A3 when both the following hold: B3 has a positive value,
and C3 is greater than D3; the value of A5 should be 0 in other cases.
8(c) How many bytes are used to store a single character in ASCII encoding?
8(d) Suppose a data scientist needs to communicate results to business managers;
describe a goal that the targets are interested in, to which the results should be
connected.
8(e) Explain what is meant when we say a logical structure for a dataset is
“denormalised”.
8(f) Sometimes, data scientists remove from their dataset any item where some value is
missing. Describe one way in which doing so can lead to poor results.
8(g) When is access control described as fine-grained?
8(h) Give a formula (involving adding or subtracting various powers of 2) for the numeric
value whose representation in 2-byte two’s complement signed binary is 0xF24A
8(i) Define the “lie factor” (a term used by Tufte for charts)
8(j) Give an algorithm that can be used for regression, that is not linear regression.
8(k) What type of attribute is predicted in a classification task?
8(l) What is an algorithm used for a clustering task?
8(m) Explain the meaning of collaborative filtering
8(n) When is a merge performed in a version control system?
8(o) What is meant when we say that a classifier has the “equalized odds” fairness
property.

Page 8 of 13
Q9 [15 minutes]
You have been given a text file products.csv containing lines of comma-separated
data about some household products. The first few lines look like this (note that the first
line is a header, and also note that the fields do not themselves contain any commas):
prodID,makerName,energyScore,category
37089,Artem,3,dishwasher
47115,Goldrod,2,fridge
51092,4Star,1,dryer
53490,FASTAr,3,fridge
9(a) [5 minutes] We would like to find the prodID and makerName, for each product in
the file whose energyScore is less than 4. Write a Python program that accesses
products.csv and prints out the desired information. You do not need to deal with
misformatted files or other errors. You are allowed to use a library like Pandas, but this is
not required.
9(b) [10 minutes] We would like to find, for each combination of a maker whose name
starts with “fa”, and an energyScore, how many products there are (that have the given
energyScore and are produced by the particular maker) and which have category
“dishwasher”. You should ignore case in maker names; so “FASTAr” and “Fastar” and
“fastar” are all considered the same maker. Write a Python program that accesses
products.csv and prints out the desired information. You do not need to deal with
misformatted files or other errors. You are allowed to use a library like Pandas, but this is
not required.

Page 9 of 13
Q10 [15 minutes]
Here is a data set with information about universities in the state of Victoria.
University Category
Fulltime
Employment
Educational
experience
Deakin Regional 77.9 82.2
LaTrobe City 66.8 76.1
Monash Go8 77.3 77
RMIT City 73.8 77
Melbourne Go8 77.5 73.9
Victoria City 66.7 74.6

Fred Foolish has produced the following chart, displaying the data from this table as a
linechart with two data series. The goal of the visualisation is to provide insight into the
relationship between fulltime employment and educational experience, in particular,
how this relationship might vary between university categories. Fred has chosen to
encode Fulltime Employment (which is a quantitative or numeric attribute) as the
position on the y-axis of the solid line, and Educational experience (another quantitaive
or numeric attribute) is encoded as the position on y-axis of the dashed line, the
nominal/categorical University is encoded on the x-axis, and the university Category
(another nominal/categorical attribute) is shown as an extra piece of text above the
University name on the x-axis.

0
10
20
30
40
50
60
70
80
90
Regional City Go8 City Go8 City
Deakin LaTrobe Monash RMIT Melbourne Victoria
Fulltime Employment Educational experience

Page 10 of 13

10(a) [4 minutes] Identify some aspects of Fred’s visualisation that do not make it easy
for a reader to gain insight into the relationship between Fulltime Employment and
Educational experience, in particular, how this relationship might vary between
Categories.
10(b) [11 minutes] Propose a different visualisation of some of the data from the table
above, that will provide better insight into the relationship between Fulltime
Employment and Eductaional experience, in particular, how this relationship might vary
between Categories. You should structure your answer in three subparts in the spaces
provided below: (i) describe which attributes will be shown and how each of these
attributes will be encoded, (ii) sketch how the visualisation will look (you do not need to
place the marks in exact positions for the given data), and (iii) explain why your proposal
conveys the important information more clearly than using Fred’s encoding.

Q11 [10 minutes; 5 minutes per sub-part]
11(a) A data scientist needs to follow data management policies that are set by their
organization and/or their client. Give one example of a data management policy that
might apply to a data science project, and explain why this policy can be important. Also,
describe a mechanism that can be used to help the data scientist in following this policy.
11(b) One important piece of metadata about a dataset, is the description of the data
format in which the data is stored. Give one example of some information that could be
kept in two different formats and describe these formats. Also, explain one way in which
this metadata about the data format can be recorded.

Q12 [10 minutes]
In Stage3 of the group project, you produced Python code to analyse a dataset, and to
produce a predictive model for some aspect of the dataset. Write a description aimed at
an employer you would like to work for, about the code you wrote, and how this
demonstrates skills that will be useful in other situations the employer might want you to
work on.

Page 11 of 13
Q13 [20 minutes]
13(a) [8 minutes] Trace the execution of the following Python code (including diagrams
with the state of the notional machine after each line of code is executed), and also write
the output that will be printed when this is run.
def myminus(x, y):
total = x - y
print("In myminus, x =", x)
print("In myminus, y =", y)
print("In myminus, total =", total)
return total

x=5
y=6
value = 17
total = 10
y = myminus(value + 1, total)
print("x =", x)
print("y =", y)
print("value =", value)
print("total =", total)

13(b) [6 minutes] Explain in English the purpose of the following Pandas code, and
explain how each of the operations is performed, to achieve this purpose. Also show
what is printed when this code is executed.
import pandas as pd
dict = {"Category":\
{"Deakin":"Regional","LaTrobe":"City","Monash":"Go8",\
"RMIT":"City", "Melbourne":"Go8","Victoria":"City"},\
"Fulltime Employment":\
{"Deakin":77.9,"LaTrobe":66.8,"Monash":77.3,"RMIT":73.8,\
"Melbourne":77.5,"Victoria":66.7},\
"Educational experience": \
{"Deakin":82.2,"LaTrobe":76.1,"Monash":77.0,"RMIT":77.0,\
"Melbourne":73.9,"Victoria":74.6}}
df = pd.DataFrame(dict)
df1 = df[df["Fulltime Employment"] > 70.0]
df2 = df1["Educational experience"]
df3 = df2.max()
print(df3)

13(c) [6 minutes] Write an explanation, for a potential student considering studying
DATA1002, of the concept of Excel’s pivot table, and why it is worth learning how to
create pivot tables.

Page 12 of 13
Q14 [10 minutes]
It is a common part of a data science activity, to produce a predictive model for some
feature of the data. Describe what a predictive model is, and describe the steps one
does to produce and then deploy a predictive model. Provide examples taken from your
experiences in this unit of study (from labs, Grok tasks, Project Stage 3, etc) of each of
the steps.
Q15 [10 minutes]
Suppose that you are helping as a data scientist, in a project that is looking at data about
university quality (for example, the raw survey data that will eventually produce
summaries like those mentioned in Question 1 above). At the moment, the project is
keeping its data, and producing summaries and charts, using Excel spreadsheets. The
project leader (who is an expert in higher education) has heard people saying Python is a
better approach for doing data science tasks, compared to Excel. Explain to the project
leader in a short text three advantages and three disadvantages of changing to Python
from Excel in the work of the project.
Q15 [10 minutes]
When doing a data science activity, it is considered important to have metadata for the
data one works with. Explain what metadata is, give examples of metadata that you
found in relation to some dataset you looked at during this unit of study, and indicate
how this metadata could be useful in working with this data.
Q16 [10 minutes]
Consider a dataset in a text file unis.csv containing lines of comma-separated data
about some universities, and how their graduates report on outcomes from the
education. The first few lines look like this (note that the first line is a header, and also
note that the fields do not themselves contain any commas):
UniName,State,Employment(2018),Employment(2019)
CQU,QLD,79.1,79.6
Curtin,WA,72.4,71.4
Deakin,VIC,72.8,73.4
This data file can be described as having a “wide” logical data format, and a csv file
format. Explain the meaning of the terms “logical data format” and “file format”; also
describe a different logical data format which might be used instead, for the same
information (you should show how some of the values shown above, would appear in
this alternative format).
Q17 [10 minutes]
A list a1 = [3, 4, 6] and a tuple b1 = (3, 4, 6) are different, because a list is
mutable while a tuple is immutable. Explain what mutable means, and give code

Page 13 of 13
examples to show how a1 and b1 behave differently in Python code, because of the
mutability of a1 and the immutability of b1.

Q18 [10 minutes]
Below is some Pandas code, to manipulate a dataframe df where the rows are indexed
by integer row number, while the columns are indexed as “Id”, “Area”, “Height”,
“Weight”, “Colour”.
Give in English, the instructions for a data analysis task, for which this code would be a
valid way to do the calculations; also describe the index structure of result

import pandas as pd
# code that imports the data and places it in the dataframe
df
df['Vol'] = df['Area']*df['Height']
result = (df[df['Vol']>40])[['Id','Colour']]

学霸联盟