R代写 - FIT5145 Introduction to Data Science
Instructions You can answer each question by directly typing your answer in the corresponding space provided to the question. Marks are indicated next to each question. This exam paper consists of 2 parts and the total marks for the exam are 100. Page 3 of 33 Part 1 (42 marks in total) Multiple Choice Questions: This section is worth 42 marks. Each question is worth 1 mark. Identify the choice that best completes the statement or answers the question. There is only one best answer to each question. Sometimes two answers may appear feasible, but you are to pick the one you believe is the best. Marking Scheme for Multiple Choice Questions: • 1 mark for a correct answer • 0 marks for the wrong answer, more than one answer, or no answer Page 4 of 33 QUESTION 1.1: Data management Which of the following is not data management in practice: A. locking the filing cabinet when you leave on Friday B. interviewing a new patient to get their details C. writing your name and date onto a backup CD you just burned D. throwing an old file in the shredder Answer: QUESTION 1.2: Collecting data A data management plan for collecting scientific data should include: A. documenting the semantic meaning of all values being recorded B. providing the units of measurement for all values being collected C. consistent use of codes and special values such as “missing” D. ALL of the other options Answer: QUESTION 1.3: Data management A data management plan for an organisation often deals with issues relating to: A. integration and data warehousing B. replication and persistance C. standardising the vocabulary used across the organisation D. ALL of the other options Answer: QUESTION 1.4: Implicit data Complete the following sentence: Implicit data is ... A. data created during indexing and used for providing free-text search functionality. B. highly inaccurate so should never be used. C. data that is not explicitly stored but inferred with reasonable precision from available data. D. highly accurate and allows us to avoid any ethical concerns associated with explicitly collecting private data. Answer: Page 5 of 33 QUESTION 1.5: Government agencies General best practices for government agencies with data management do not need to include: A. mandated software and preferred suppliers B. sensible risk management practices C. ethical leadership on their use of big data D. clear and transparent privacy policies Answer: QUESTION 1.6: Privacy, confidentiality and security Which of the following statements about privacy, confidentiality and security is TRUE? A. The privacy of users online is never very important, only their confidentiality, since it has legal implications. B. Security is the act of protecting confidential information such that the privacy of users is never violated. C. Online activity of users is confidential, but never affects their offline lives, so privacy is unaffected. D. There is no difference between the three concepts of privacy, confidentiality and security. —— They all refer to the same idea. Answer: QUESTION 1.7: Governance Data governance does NOT involve dealing with which of the following: A. taxidermy B. privacy issues C. legal compliance D. archiving Answer: QUESTION 1.8: Privacy Privacy and confidentiality: A. are the same, confusingly B. the second refers to what you can do regarding the privacy of others C. the second is about information only D. the second is the more precise legal term Answer: Page 6 of 33 QUESTION 1.9: Scripting languages Which of the following is not a scripting language? A. R B. RapidMiner C. Matlab D. Python Answer: QUESTION 1.10: Clinical trials A clinical trial is primarily designed to: A. apply the principle of intervention to test cause. B. test correlation between treatments and outcomes. C. stop scientists from cheating. D. isolate the causes of outcomes. Answer: QUESTION 1.11: Significance testing Complete this sentence. Significance testing can lead to inaccurate conclusions when: A. scientists test for correlations between many different variables in an experiment, but only report on the large values. B. scientists use poor experimental methodology leading to inadequate repeatability. C. scientists run many different experiments but only ever report results when their outcome is positive. D. ALL the other options are correct. Answer: QUESTION 1.12: Shell Command How many lines will be output by the following shell command? cat data.csv | awk -F’,’ ’rand()<1/10 print $7’ A. impossible to tell B. exactly 10 C. exactly 7 D. approximately 10% of the lines in the original file Answer: Page 7 of 33 QUESTION 1.13: Evaluating algorithms Complete this sentence. When evaluating algorithms, training and test sets usually: A. carefully selected to stress algorithms. B. are taken from the same set of data, but are non-overlapping. C. are drawn from very different sources. D. are the same data sets. Answer: QUESTION 1.14: Evaluating the results of learning When evaluating and presenting the results of learning, it is not important to: A. record the processing steps for background and reproducibility. B. work with a domain expert to understand the proper costs and benefits of different out￾comes and errors. C. use the standard significance level of 0.01. D. keep a separate data set for unbiased testing. Answer: QUESTION 1.15: Top data science tools Why are many of the top data science tools open source? A. Microsoft gifted the software. B. Data scientists cannot train on costly commercial tools. C. They were initially academic projects. D. Most companies prefer to work with open source. Answer: QUESTION 1.16: Market segmentation What is market segmentation? A. To break up marketing data into parts, e.g. to more easily use MapReduce. B. Another name for market basket analysis. C. A form of clustering done to partition consumers into similar groups to allow bulk targeting. D. A technique that uses latent variables so that linear regression can be applied. Answer: Page 8 of 33 QUESTION 1.17: Learning curve In Machine Learning the term “learning curve” refers to: A. the speed at which users become accustomed to new technology B. a graph of the predictions of a polynomial regression model C. a graph of performance of a predictive model versus the quantity of data used to train it D. a graph of the difference between the loss and error functions for a predictive model Answer: QUESTION 1.18: Bias in a learning algorithm Complete this sentence. A large bias in a learning algorithm means: A. the algorithm works with simpler models so that it is not able to fit the data as well. B. the algorithm works with more complex models which are able to be biased more away from the data. C. the algorithm encodes the programmers’ confirmation bias. D. the algorithm is using regularisation to enforce a bias as a way of preventing overfitting. Answer: QUESTION 1.19: Data separation Separating data so individual departments manage their own: A. causes problems because of inconsistencies across departments B. is cheaper C. allows Hadoop-style processing to be done more easily D. is the preferred solution to managing volume and variety in large organisations Answer: QUESTION 1.20: Spark Spark was built into the Hadoop platform because: A. it is implemented on top of the basic MapReduce mechanism of Hadoop B. the same core programmer team did the initial development C. to gain from the Hadoop brand-name D. it is easier to build on top of the Hadoop infrastructure Answer: Page 9 of 33 QUESTION 1.21: Database types How does a graph database differ from a relational database? A. they are the same B. graph databases are better at storing and analysing data interaction patterns C. graph databases are used for storing graphics D. graph databases are used for money transfers Answer: QUESTION 1.22: Data processing Which of these is NOT a common type of data processing approach? A. streaming B. interactive C. bidirectional D. batch Answer: QUESTION 1.23: Digital containers A digital container format is designed specifically to give: A. descriptive metadata by demarcating content from annotation B. structural metadata by arranging embedded content C. administrative metadata by incorporating metadata standards D. markup language via text entries Answer: QUESTION 1.24: Volume Volume in the big data definitions is: A. relative, as it varies with the kind of data and current processing capacity, and the task being performed B. best measured in terabytes back in the year 2001 C. best measured in yottabytes D. best measured in units relative to typical hard-drive capacity of the time Answer: Page 10 of 33 QUESTION 1.25: Disks Over the years, disk capacity is generally growing: A. linearly B. logarithmically C. quadratically D. exponentially Answer: QUESTION 1.26: DBMS Which of the following statements about different types of databases is FALSE: A. MongoDB stores data in a JSON-like documents and is therefore not a NoSQL database B. MySQL is an example of a Relational DBMS C. HBase is modeled after Google’s “Bigtable” D. Cassandra is an example of a “wide column store” Answer: QUESTION 1.27: MapReduce Google no longer uses MapReduce because: A. it is open source, and they did not develop it B. they sold it C. they realised it does not handle some more complex types of distributed processing well D. NONE of the other options Answer: QUESTION 1.28: NoSQL The growth of NoSQL databases occurred because: A. NoSQL is a more powerful query language than SQL B. NoSQL databases support standard web applications, while RDBMSs cannot C. NoSQL databases are much less expensive than RDBMSs D. NoSQL databases provide simplicity of use and scaling (at a cost of reduced functionality) Answer: Page 11 of 33 QUESTION 1.29: Database issues Distributed databases, in-memory databases and RDBMSs are specifically designed to address the following issue: A. the need for cheaper systems B. the need for security C. the need for scalable systems D. the need to handle semi-structured data Answer: QUESTION 1.30: Data Scientist Ideally, a Data Scientist should have strong: A. domain expertise or be working with a colleague who has. B. understanding of machine learning and statistics. C. ability to prototype software and script tasks (e.g. in Python, R). D. ALL of the others. Answer: QUESTION 1.31: Python and R Which of the following statements about Python and R is TRUE? A. R and Python can both be used for building predictive models. B. Python doesn’t provide support for data frames. C. Python is an extension of the R programming language. D. R cannot be used to fit a linear regression. Answer: QUESTION 1.32: Data Wrangling Which of the following is NOT a data wrangling activity? A. Carry out A/B testing. B. To fill in missing values in data. C. Discretise the data into a set of values. D. Remove record/row for missing values in data. Answer: Page 12 of 33 QUESTION 1.33: R The following R code: myData <- read.table("myFile.csv",header=TRUE,sep=",") plot(height~age,data=myData) fit <- lm(height~age,data=myData) abline(fit,col='red') A. groups individuals by their height and age, and returns their fitness B. computes an A/B line-test for an LM fit C. plots a linear regression of height against age D. does NONE of these options Answer: QUESTION 1.34: GapMinder Motion Charts such as those found in GapMinder and Google Sheets: A. can be used to fit a polynomial regression model. B. allow us to visualise multiple dimensions of data at once. C. are confusing to interpret. D. are used to fit linear regressions. Answer: QUESTION 1.35: Dublin Core Dublin Core, PMML, CRISP-DM and SNoMed-CT are all examples of: A. metadata standards and domain-specific vocabularies B. predictive models used in machine learning algorithms C. NoSQL databases D. tools for wrangling data into a format necessary for further processing Answer: QUESTION 1.36: Predictive Models Which of the following statements regarding predictive models is FALSE: A. A classification model can be seen as a way to divide up the feature space. B. All features will be equally important for making good predictions. C. When evaluating a predictive model, we should use test data that was not used for training. D. Generally, the more data used to train a model, the more accurate its predictions. Answer: Page 13 of 33 QUESTION 1.37: Data science tools Which of the following statements about data science tools is NOT true? A. R is generally more scalable than Java or Python. B. Defining an array in R requires using the concatenation function: c(). C. R was developed by statisticians. D. Java can be used for building data science projects. Answer: QUESTION 1.38: Rationale What type of model describes the rationale of how an organisation creates, delivers, and captures value, in economic, social, cultural or other contexts? A. an organisation model B. none of the three other options C. a data model D. a business model Answer: QUESTION 1.39: Shell commands Unix shell commands like “less” and “grep”: A. are examples of technology that is too old to be useful to a modern data scientist B. can be used to manipulate large data files easily C. are used to fit regression tree models D. are poorly documented Answer: QUESTION 1.40: Shell What does this Unix Shell command do? cut -f 5 data.txt | sort | uniq A. Checks whether any of the boolean variables ’data.txt’, ’sort’ and ’uniq’ are true B. Cuts the file ’data.txt’ into five pieces and sorts them C. Returns all unique values in the fifth column of ’data.txt’ D. Imputes missing values in the file ’data.txt’ Answer: Page 14 of 33 QUESTION 1.41: Open data Which of the following is true about “open data”? A. Open data is both private and machine readable B. Open data is always useful C. Open data is a machine-readable data that is publicly available D. None of the above options Answer: Figure 1: Vacation Planning Influence Diagram QUESTION 1.42: Influence Diagrams Relationships Which of the numbered links in the Vacation Planning Influence Diagram, Figure 1, is NOT a valid relationship within an Influence Diagram? A. 2 B. 1 C. 3 D. 4 Answer: Page 15 of 33 Part 2 (58 marks in total) Short Answer Questions: This section is worth 58 marks. Your answers should be written in clear, simple English and should be complete enough in addressing the question. Extensive prose is not required. Structured bullet points are acceptable. Question 2.1 (2 marks) DataWrangler and Python can both be used for data wrangling. Describe some characteristics of these tools that could be used to choose between them. Question 2.2 (2 marks) Give an example where two very different data sets needed to be combined in order to make a data science project work. Page 16 of 33 Question 2.3 (2 marks) Jackie Chan, Steven Seagal, Chuck Norris and Jean-Claude Van Damme have come together to develop a martial arts training game for the Sony PlayStation. The idea is to use motion-capture technology to sense the movements of individuals who are practicing martial arts in front of the console and to recommend to them ways to improve their style. The movie stars employ a data scientist to build the required prediction system. In order to start building the system, what would be the first tasks the data scientist would perform in order to obtain the right data for analysis? Question 2.4 (2 marks) Suppose you are hired as a Data Scientist by a leading Financial Institution in your country, which specialises in providing home loans to customers. Your first assignment on the job is to undertake a review of the current procedures used by the company and come up with innovative customer￾focused data products that could benefit the organisation. Give examples of some descriptive, prescriptive, and predictive analytics that could be generated by your team that could be useful to the organisation. Page 17 of 33 Question 2.5 (2 marks) Give two examples of Data Science applications specifically in retail (selling mass produced goods to consumers). Question 2.6 (2 marks) What is a scripting language, and what is their relationship to rapid prototyping? Page 18 of 33 Question 2.7 (2 marks) Consider a graph database, such as DBpedia. Give an example of a commercial or government application that would use a graph database, and discuss why it is appropriate. Question 2.8 (2 marks) Hadoop (using Google’s MapReduce framework) is a system for distributed computation. Give an example of a data analysis task that performs poorly with Hadoop and explain why. Page 19 of 33 Question 2.9 (2 marks) Describe what bias and variance are. . Question 2.10 (2 marks) How big should data be in order to be considered “Big Data”? i.e. what are two main features that characterise big data versus “small data”? Page 20 of 33 Question 2.11 (2 marks) What is the Predictive Model Markup Language and what is it used for? Question 2.12 (2 marks) You have been working on a data science project and each team member has come up with a different model for classifying the same dataset. The team leader suggested that they should use multiple models for this classification task. Do you agree that using all the different models for the classification task is better than agreeing to a single model? Explain your answer. Page 21 of 33 Question 2.13 (2 marks) What is “Linked Open Data”? Why is it called “linked” and why “open”? What sort of format can it be in? Question 2.14 (2 marks) What is a clinical trial and why are they used? Give an example. Page 22 of 33 Question 2.15 (2 marks) Explain how Koomey’s Law has affected data science. Question 2.16 (2 marks) They say “correlation does not imply causation”. Give an example of variables that are correlated but not causal, and explain your example. Page 23 of 33 An Automobile Insurance Company Imagine you are the manager of the fraud department of an automobile insurance company and you hire three junior data scientists over summer. You ask them to analyse a database of insurance claims which has the instances of fraud tagged. Using the data, the three data scientists build three different models for predicting fraud. Question 2.17 (2 marks) With respect to the Automobile Insurance Company, how would you evaluate the models developed by the three data scientists at the end of the summer? Question 2.18 (2 marks) With respect to the Automobile Insurance Company, the Chief Data Officer (CDO) discovers your plans to provide insurance claims data to the three junior data scientists (who are not employees, but students) over the summer. What concerns if any should the CDO have? Page 24 of 33 Question 2.19 (2 marks) Explain what the following shell script is trying to achieve? (2 marks) cat bushfire_tweets.csv.gz | gunzip | grep ‘Melbourne’ | wc -l Question 2.20 (2 marks) Regulatory compliance is a data management issue. Describe how compliance impacts Data Science and the nature of the impact. Page 25 of 33 Question 2.21 (2 marks) Note that the car industry underwent a digitisation process (adding a bus, adding digital sensors) followed by a datafication process (managing, exporting and allowing access of data). Give another (non-automotive) industry that has had similar developments in recent decades. Describe the kinds of data. How do you expect this to change this industry? Question 2.22 (2 marks) Why are pipes and redirects in the Unix Shell useful for dealing with big data? Page 26 of 33 Question 2.23 (2 marks) Give an example from the healthcare/medical domain where good data collection followed by analysis has (or could) led to improvement patient outcomes. Briefly explain the (1) data collected, the (2) analysis done and (3) the improvement in outcomes. Question 2.24 (2 marks) Describe why medical data analytics has different data curation requirements to data analytics for astrophysics. Page 27 of 33 Question 2.25 (2 marks) In the context of a government agency, list two forms of metadata that might be associated with a letter. HINT: the content of the letter is not metadata. Question 2.26 (2 marks) Give an example of a business or organisation whose legal obligations with their data restrict the use they could make of the data. Make sure to note why/how the legal obligations conflict with the business objectives. a) Describe the business/organisation in one sentence (0 marks, but informs b): b) Describe the legal and the business objectives that are conflicting, and why/how: Page 28 of 33 Question 2.27 (2 marks) Why is metadata important for data analysis? Question 2.28 (2 marks) Assuming that ’myTable’ and ’myTable2’ are two data frames, what would the following R code output? rbind(myTable, myTable2) Page 29 of 33 Question 2.29 (2 marks) Consider the simple influence diagram in Figure 2 describing the situation of a driver whose car doesn’t start. What does the Car Start Influence Diagram, Figure 2, inform you about the node “dipstick”? Figure 2: Car Start Influence Diagram Page 30 of 33 END OF EXAM Page 31 of 33 Blank page for notes or additional answers if needed. Page 32 of 33 Blank page for notes or additional answers if needed.