R代写 - FIT5145 Introduction to Data Science
时间:2020-11-28
Instructions
You can answer each question by directly typing your answer in the corresponding space provided
to the question. Marks are indicated next to each question. This exam paper consists of 2 parts and
the total marks for the exam are 100.
Page 3 of 33
Part 1 (42 marks in total)
Multiple Choice Questions: This section is worth 42 marks. Each question is worth 1 mark. Identify
the choice that best completes the statement or answers the question. There is only one best
answer to each question. Sometimes two answers may appear feasible, but you are to pick the
one you believe is the best.
Marking Scheme for Multiple Choice Questions:
• 1 mark for a correct answer
• 0 marks for the wrong answer, more than one answer, or no answer
Page 4 of 33
QUESTION 1.1: Data management
Which of the following is not data management in practice:
A. locking the filing cabinet when you leave on Friday
B. interviewing a new patient to get their details
C. writing your name and date onto a backup CD you just burned
D. throwing an old file in the shredder
Answer:
QUESTION 1.2: Collecting data
A data management plan for collecting scientific data should include:
A. documenting the semantic meaning of all values being recorded
B. providing the units of measurement for all values being collected
C. consistent use of codes and special values such as “missing”
D. ALL of the other options
Answer:
QUESTION 1.3: Data management
A data management plan for an organisation often deals with issues relating to:
A. integration and data warehousing
B. replication and persistance
C. standardising the vocabulary used across the organisation
D. ALL of the other options
Answer:
QUESTION 1.4: Implicit data
Complete the following sentence: Implicit data is ...
A. data created during indexing and used for providing free-text search functionality.
B. highly inaccurate so should never be used.
C. data that is not explicitly stored but inferred with reasonable precision from available data.
D. highly accurate and allows us to avoid any ethical concerns associated with explicitly
collecting private data.
Answer:
Page 5 of 33
QUESTION 1.5: Government agencies
General best practices for government agencies with data management do not need to include:
A. mandated software and preferred suppliers
B. sensible risk management practices
C. ethical leadership on their use of big data
D. clear and transparent privacy policies
Answer:
QUESTION 1.6: Privacy, confidentiality and security
Which of the following statements about privacy, confidentiality and security is TRUE?
A. The privacy of users online is never very important, only their confidentiality, since it has
legal implications.
B. Security is the act of protecting confidential information such that the privacy of users is
never violated.
C. Online activity of users is confidential, but never affects their offline lives, so privacy is
unaffected.
D. There is no difference between the three concepts of privacy, confidentiality and security.
—— They all refer to the same idea.
Answer:
QUESTION 1.7: Governance
Data governance does NOT involve dealing with which of the following:
A. taxidermy
B. privacy issues
C. legal compliance
D. archiving
Answer:
QUESTION 1.8: Privacy
Privacy and confidentiality:
A. are the same, confusingly
B. the second refers to what you can do regarding the privacy of others
C. the second is about information only
D. the second is the more precise legal term
Answer:
Page 6 of 33
QUESTION 1.9: Scripting languages
Which of the following is not a scripting language?
A. R
B. RapidMiner
C. Matlab
D. Python
Answer:
QUESTION 1.10: Clinical trials
A clinical trial is primarily designed to:
A. apply the principle of intervention to test cause.
B. test correlation between treatments and outcomes.
C. stop scientists from cheating.
D. isolate the causes of outcomes.
Answer:
QUESTION 1.11: Significance testing
Complete this sentence. Significance testing can lead to inaccurate conclusions when:
A. scientists test for correlations between many different variables in an experiment, but only
report on the large values.
B. scientists use poor experimental methodology leading to inadequate repeatability.
C. scientists run many different experiments but only ever report results when their outcome
is positive.
D. ALL the other options are correct.
Answer:
QUESTION 1.12: Shell Command
How many lines will be output by the following shell command?
cat data.csv | awk -F’,’ ’rand()<1/10 print $7’
A. impossible to tell
B. exactly 10
C. exactly 7
D. approximately 10% of the lines in the original file
Answer:
Page 7 of 33
QUESTION 1.13: Evaluating algorithms
Complete this sentence. When evaluating algorithms, training and test sets usually:
A. carefully selected to stress algorithms.
B. are taken from the same set of data, but are non-overlapping.
C. are drawn from very different sources.
D. are the same data sets.
Answer:
QUESTION 1.14: Evaluating the results of learning
When evaluating and presenting the results of learning, it is not important to:
A. record the processing steps for background and reproducibility.
B. work with a domain expert to understand the proper costs and benefits of different outcomes and errors.
C. use the standard significance level of 0.01.
D. keep a separate data set for unbiased testing.
Answer:
QUESTION 1.15: Top data science tools
Why are many of the top data science tools open source?
A. Microsoft gifted the software.
B. Data scientists cannot train on costly commercial tools.
C. They were initially academic projects.
D. Most companies prefer to work with open source.
Answer:
QUESTION 1.16: Market segmentation
What is market segmentation?
A. To break up marketing data into parts, e.g. to more easily use MapReduce.
B. Another name for market basket analysis.
C. A form of clustering done to partition consumers into similar groups to allow bulk
targeting.
D. A technique that uses latent variables so that linear regression can be applied.
Answer:
Page 8 of 33
QUESTION 1.17: Learning curve
In Machine Learning the term “learning curve” refers to:
A. the speed at which users become accustomed to new technology
B. a graph of the predictions of a polynomial regression model
C. a graph of performance of a predictive model versus the quantity of data used to train it
D. a graph of the difference between the loss and error functions for a predictive model
Answer:
QUESTION 1.18: Bias in a learning algorithm
Complete this sentence. A large bias in a learning algorithm means:
A. the algorithm works with simpler models so that it is not able to fit the data as well.
B. the algorithm works with more complex models which are able to be biased more away
from the data.
C. the algorithm encodes the programmers’ confirmation bias.
D. the algorithm is using regularisation to enforce a bias as a way of preventing overfitting.
Answer:
QUESTION 1.19: Data separation
Separating data so individual departments manage their own:
A. causes problems because of inconsistencies across departments
B. is cheaper
C. allows Hadoop-style processing to be done more easily
D. is the preferred solution to managing volume and variety in large organisations
Answer:
QUESTION 1.20: Spark
Spark was built into the Hadoop platform because:
A. it is implemented on top of the basic MapReduce mechanism of Hadoop
B. the same core programmer team did the initial development
C. to gain from the Hadoop brand-name
D. it is easier to build on top of the Hadoop infrastructure
Answer:
Page 9 of 33
QUESTION 1.21: Database types
How does a graph database differ from a relational database?
A. they are the same
B. graph databases are better at storing and analysing data interaction patterns
C. graph databases are used for storing graphics
D. graph databases are used for money transfers
Answer:
QUESTION 1.22: Data processing
Which of these is NOT a common type of data processing approach?
A. streaming
B. interactive
C. bidirectional
D. batch
Answer:
QUESTION 1.23: Digital containers
A digital container format is designed specifically to give:
A. descriptive metadata by demarcating content from annotation
B. structural metadata by arranging embedded content
C. administrative metadata by incorporating metadata standards
D. markup language via text entries
Answer:
QUESTION 1.24: Volume
Volume in the big data definitions is:
A. relative, as it varies with the kind of data and current processing capacity, and the task
being performed
B. best measured in terabytes back in the year 2001
C. best measured in yottabytes
D. best measured in units relative to typical hard-drive capacity of the time
Answer:
Page 10 of 33
QUESTION 1.25: Disks
Over the years, disk capacity is generally growing:
A. linearly
B. logarithmically
C. quadratically
D. exponentially
Answer:
QUESTION 1.26: DBMS
Which of the following statements about different types of databases is FALSE:
A. MongoDB stores data in a JSON-like documents and is therefore not a NoSQL database
B. MySQL is an example of a Relational DBMS
C. HBase is modeled after Google’s “Bigtable”
D. Cassandra is an example of a “wide column store”
Answer:
QUESTION 1.27: MapReduce
Google no longer uses MapReduce because:
A. it is open source, and they did not develop it
B. they sold it
C. they realised it does not handle some more complex types of distributed processing well
D. NONE of the other options
Answer:
QUESTION 1.28: NoSQL
The growth of NoSQL databases occurred because:
A. NoSQL is a more powerful query language than SQL
B. NoSQL databases support standard web applications, while RDBMSs cannot
C. NoSQL databases are much less expensive than RDBMSs
D. NoSQL databases provide simplicity of use and scaling (at a cost of reduced functionality)
Answer:
Page 11 of 33
QUESTION 1.29: Database issues
Distributed databases, in-memory databases and RDBMSs are specifically designed to address the
following issue:
A. the need for cheaper systems
B. the need for security
C. the need for scalable systems
D. the need to handle semi-structured data
Answer:
QUESTION 1.30: Data Scientist
Ideally, a Data Scientist should have strong:
A. domain expertise or be working with a colleague who has.
B. understanding of machine learning and statistics.
C. ability to prototype software and script tasks (e.g. in Python, R).
D. ALL of the others.
Answer:
QUESTION 1.31: Python and R
Which of the following statements about Python and R is TRUE?
A. R and Python can both be used for building predictive models.
B. Python doesn’t provide support for data frames.
C. Python is an extension of the R programming language.
D. R cannot be used to fit a linear regression.
Answer:
QUESTION 1.32: Data Wrangling
Which of the following is NOT a data wrangling activity?
A. Carry out A/B testing.
B. To fill in missing values in data.
C. Discretise the data into a set of values.
D. Remove record/row for missing values in data.
Answer:
Page 12 of 33
QUESTION 1.33: R
The following R code:
myData <- read.table("myFile.csv",header=TRUE,sep=",")
plot(height~age,data=myData)
fit <- lm(height~age,data=myData)
abline(fit,col='red')
A. groups individuals by their height and age, and returns their fitness
B. computes an A/B line-test for an LM fit
C. plots a linear regression of height against age
D. does NONE of these options
Answer:
QUESTION 1.34: GapMinder
Motion Charts such as those found in GapMinder and Google Sheets:
A. can be used to fit a polynomial regression model.
B. allow us to visualise multiple dimensions of data at once.
C. are confusing to interpret.
D. are used to fit linear regressions.
Answer:
QUESTION 1.35: Dublin Core
Dublin Core, PMML, CRISP-DM and SNoMed-CT are all examples of:
A. metadata standards and domain-specific vocabularies
B. predictive models used in machine learning algorithms
C. NoSQL databases
D. tools for wrangling data into a format necessary for further processing
Answer:
QUESTION 1.36: Predictive Models
Which of the following statements regarding predictive models is FALSE:
A. A classification model can be seen as a way to divide up the feature space.
B. All features will be equally important for making good predictions.
C. When evaluating a predictive model, we should use test data that was not used for
training.
D. Generally, the more data used to train a model, the more accurate its predictions.
Answer:
Page 13 of 33
QUESTION 1.37: Data science tools
Which of the following statements about data science tools is NOT true?
A. R is generally more scalable than Java or Python.
B. Defining an array in R requires using the concatenation function: c().
C. R was developed by statisticians.
D. Java can be used for building data science projects.
Answer:
QUESTION 1.38: Rationale
What type of model describes the rationale of how an organisation creates, delivers, and captures
value, in economic, social, cultural or other contexts?
A. an organisation model
B. none of the three other options
C. a data model
D. a business model
Answer:
QUESTION 1.39: Shell commands
Unix shell commands like “less” and “grep”:
A. are examples of technology that is too old to be useful to a modern data scientist
B. can be used to manipulate large data files easily
C. are used to fit regression tree models
D. are poorly documented
Answer:
QUESTION 1.40: Shell
What does this Unix Shell command do?
cut -f 5 data.txt | sort | uniq
A. Checks whether any of the boolean variables ’data.txt’, ’sort’ and ’uniq’ are true
B. Cuts the file ’data.txt’ into five pieces and sorts them
C. Returns all unique values in the fifth column of ’data.txt’
D. Imputes missing values in the file ’data.txt’
Answer:
Page 14 of 33
QUESTION 1.41: Open data
Which of the following is true about “open data”?
A. Open data is both private and machine readable
B. Open data is always useful
C. Open data is a machine-readable data that is publicly available
D. None of the above options
Answer:
Figure 1: Vacation Planning Influence Diagram
QUESTION 1.42: Influence Diagrams Relationships
Which of the numbered links in the Vacation Planning Influence Diagram, Figure 1, is NOT a valid
relationship within an Influence Diagram?
A. 2
B. 1
C. 3
D. 4
Answer:
Page 15 of 33
Part 2 (58 marks in total)
Short Answer Questions: This section is worth 58 marks. Your answers should be written in clear,
simple English and should be complete enough in addressing the question. Extensive prose is not
required. Structured bullet points are acceptable.
Question 2.1 (2 marks)
DataWrangler and Python can both be used for data wrangling. Describe some characteristics of these
tools that could be used to choose between them.
Question 2.2 (2 marks)
Give an example where two very different data sets needed to be combined in order to make a data
science project work.
Page 16 of 33
Question 2.3 (2 marks)
Jackie Chan, Steven Seagal, Chuck Norris and Jean-Claude Van Damme have come together to develop
a martial arts training game for the Sony PlayStation. The idea is to use motion-capture technology to
sense the movements of individuals who are practicing martial arts in front of the console and to
recommend to them ways to improve their style. The movie stars employ a data scientist to build the
required prediction system. In order to start building the system, what would be the first tasks the
data scientist would perform in order to obtain the right data for analysis?
Question 2.4 (2 marks)
Suppose you are hired as a Data Scientist by a leading Financial Institution in your country, which
specialises in providing home loans to customers. Your first assignment on the job is to undertake a
review of the current procedures used by the company and come up with innovative customerfocused data products that could benefit the organisation. Give examples of some descriptive,
prescriptive, and predictive analytics that could be generated by your team that could be useful to the
organisation.
Page 17 of 33
Question 2.5 (2 marks)
Give two examples of Data Science applications specifically in retail (selling mass produced goods to
consumers).
Question 2.6 (2 marks)
What is a scripting language, and what is their relationship to rapid prototyping?
Page 18 of 33
Question 2.7 (2 marks)
Consider a graph database, such as DBpedia. Give an example of a commercial or government
application that would use a graph database, and discuss why it is appropriate.
Question 2.8 (2 marks)
Hadoop (using Google’s MapReduce framework) is a system for distributed computation. Give an
example of a data analysis task that performs poorly with Hadoop and explain why.
Page 19 of 33
Question 2.9 (2 marks)
Describe what bias and variance are.
.
Question 2.10 (2 marks)
How big should data be in order to be considered “Big Data”? i.e. what are two main features that
characterise big data versus “small data”?
Page 20 of 33
Question 2.11 (2 marks)
What is the Predictive Model Markup Language and what is it used for?
Question 2.12 (2 marks)
You have been working on a data science project and each team member has come up with a different
model for classifying the same dataset. The team leader suggested that they should use multiple
models for this classification task. Do you agree that using all the different models for the classification
task is better than agreeing to a single model? Explain your answer.
Page 21 of 33
Question 2.13 (2 marks)
What is “Linked Open Data”? Why is it called “linked” and why “open”? What sort of format can it be
in?
Question 2.14 (2 marks)
What is a clinical trial and why are they used? Give an example.
Page 22 of 33
Question 2.15 (2 marks)
Explain how Koomey’s Law has affected data science.
Question 2.16 (2 marks)
They say “correlation does not imply causation”. Give an example of variables that are correlated but
not causal, and explain your example.
Page 23 of 33
An Automobile Insurance Company
Imagine you are the manager of the fraud department of an automobile insurance company and you
hire three junior data scientists over summer. You ask them to analyse a database of insurance claims
which has the instances of fraud tagged. Using the data, the three data scientists build three different
models for predicting fraud.
Question 2.17 (2 marks)
With respect to the Automobile Insurance Company, how would you evaluate the models developed
by the three data scientists at the end of the summer?
Question 2.18 (2 marks)
With respect to the Automobile Insurance Company, the Chief Data Officer (CDO) discovers your plans
to provide insurance claims data to the three junior data scientists (who are not employees, but
students) over the summer. What concerns if any should the CDO have?
Page 24 of 33
Question 2.19 (2 marks)
Explain what the following shell script is trying to achieve? (2 marks)
cat bushfire_tweets.csv.gz | gunzip | grep ‘Melbourne’ | wc -l
Question 2.20 (2 marks)
Regulatory compliance is a data management issue. Describe how compliance impacts Data Science
and the nature of the impact.
Page 25 of 33
Question 2.21 (2 marks)
Note that the car industry underwent a digitisation process (adding a bus, adding digital sensors)
followed by a datafication process (managing, exporting and allowing access of data). Give another
(non-automotive) industry that has had similar developments in recent decades. Describe the kinds of
data. How do you expect this to change this industry?
Question 2.22 (2 marks)
Why are pipes and redirects in the Unix Shell useful for dealing with big data?
Page 26 of 33
Question 2.23 (2 marks)
Give an example from the healthcare/medical domain where good data collection followed by analysis
has (or could) led to improvement patient outcomes. Briefly explain the (1) data collected, the (2)
analysis done and (3) the improvement in outcomes.
Question 2.24 (2 marks)
Describe why medical data analytics has different data curation requirements to data analytics for
astrophysics.
Page 27 of 33
Question 2.25 (2 marks)
In the context of a government agency, list two forms of metadata that might be associated with a
letter.
HINT: the content of the letter is not metadata.
Question 2.26 (2 marks)
Give an example of a business or organisation whose legal obligations with their data restrict the use
they could make of the data. Make sure to note why/how the legal obligations conflict with the
business objectives.
a) Describe the business/organisation in one sentence (0 marks, but informs b):
b) Describe the legal and the business objectives that are conflicting, and why/how:
Page 28 of 33
Question 2.27 (2 marks)
Why is metadata important for data analysis?
Question 2.28 (2 marks)
Assuming that ’myTable’ and ’myTable2’ are two data frames, what would the following R code
output?
rbind(myTable, myTable2)
Page 29 of 33
Question 2.29 (2 marks)
Consider the simple influence diagram in Figure 2 describing the situation of a driver whose car
doesn’t start.
What does the Car Start Influence Diagram, Figure 2, inform you about the node “dipstick”?
Figure 2: Car Start Influence Diagram
Page 30 of 33
END OF EXAM
Page 31 of 33
Blank page for notes or additional answers if needed.
Page 32 of 33
Blank page for notes or additional answers if needed.