xuebaunion@vip.163.com
3551 Trousdale Rkwy, University Park, Los Angeles, CA
留学生论文指导和课程辅导
无忧GPA:https://www.essaygpa.com
工作时间:全年无休-早上8点到凌晨3点

微信客服:xiaoxionga100

微信客服:ITCS521
STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 1 Tutorial 1: Introduction to Big Data Learning Outcome: Understanding the role of R in Big Data Analytics Overview of R Studio Creating a R Program Data Manipulation in R What is R and its role in Big Data Analytics R is an open source software package to perform statistical analysis of data and glean key insights from data using mechanisms, such as regression, clustering, classification, and text analysis and is widely used by data scientist statisticians and others who need to make statistical analysis. R is registered under GNU (General Public License). It was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, which is currently handled by the R Development Core Team. It can be considered as a different implementation of S, developed by Johan Chambers at Bell Labs. There are some important differences, but a lot of the code written in S can be unaltered using the R interpreter engine [1]. R provides a wide variety of statistical, machine learning (linear and nonlinear modeling, classic statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible. R has various built-in as well as extended functions for statistical, machine learning, and visualization tasks such as: Data extraction, Data cleaning Data transformation Statistical Analysis Predictive modelling Data visualization The strengths of R lie in its ability to analyse data using a rich library of packages. R can now connect with other data stores, such as MySQL, SQLite, MongoDB, Spark for Big Data Analysis. Overview of R Studio Desktop RStudio is an Integrated Development Environment (IDE) for R, a programming language for statistical computing and graphics. 1.1 Installation of R studio desktop at home R studio has been installed in our PC laboratories for lecture and tutorial use. However, you are strongly recommended to install R studio on your own PC or laptop for self-learning purpose. Follow the following videos to install R and Rstudio on Microsoft Windows/ MacOS, respectively. STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 2 Figure 1 How to download R and install Rstudio on Windows 10 2021 https://www.youtube.com/watch?v=NZxSA80lF1I Figure 2 How to install R & RStudio on Mac in 2021 - step-by-step walkthrough https://www.youtube.com/watch?v=LanBozXJjOk Features of R Programming Language Support GPU and Distributed Computing Many statistical functions and libraries STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 3 Graphics and Data visualization Databases Overview of Rstudio 1. Console The console window (in RStudio, the bottom left panel) is the place where you can type command, and it will show the results of a command. You can type commands directly into the console, but they will be forgotten when you close the session. 2. Script Editor Alternatively, you can enter the commands in the script editor, and save the script. This way, you have a complete record of what you did, you can also save and run the script. You can copy-paste into the R console. 3. Environment The Environment tab in the top right window lists the variables and functions present in the current R session. It does not include the function/data in loaded packages however (unless you select a package from the drop down menu that says “Global Environment”). When you ask “what have I created so far”, the answer is in the environment tab. STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 4 4. File Browser The default tab in the lower right window is a basic file browser. You can open, delete, and rename files there. It is not as well-developed as your operating system’s file browser and is mostly there so you don’t have to switch applications to manage files. 5. Plots Shows the plot generated from your Rscripts or R commands typed in the console after a plot routine is executed. 6. Packages Lists all the packages you have installed. Click the “install” button for installing new packages. Click “update” for updating packages Installing and Importing R packages First, you have to identify the name of the packages to be installed. Let’s consider gplots as an example. Part I – Getting the Package onto your Computer 1. Type “install.packages(“gplots”)” and then press the Enter/Return key. STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 5 2. If you have already loaded a package from a server in the R session, then R will automatically install the package. If not, R will automatically prompt you to choose a mirror. Again, choose one close to unless you want to watch a loading bar slowly inch its way to fulfillment. Part II – Loading the Package into R 1. Type “library(ggplot2)” and then press the Enter/Return key. Working directories There are different ways to find/change the working directory, i) Using the graphical user interface and ii) Typing R commands. STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 6 Part I. Using the graphical user interface 1. Go the file browser 2. Create folder: Creates a new folder for your R codes. 3. Delete file: Tick the box next to a file (e.g. “hello. R”) and click “delete” to delete the file. 4. Rename file: Tick the box next to a file (e.g. “hello. R”) and click “rename” to rename the file. 5. Path to displayed directory: Under the new folder button, there is a bar showing the current directory. 6. Change directory: Click the button “…” on the right hand side of the path to displayed directory and a window will show up for you to choose a directory. Part II. Using R commands The following table lists some of the useful R commands for working directories STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 7 Command Action getwd() Find the current working directory (where inputs are found and outputs are sent). setwd(‘your directory path’) Change the current working directory. Getting started with R console 1. To begin, click the console tab in the bottom left hand corner of R studio. 2. We shall begin with a simple variable assignment to get familiarized with the R environment. At the console prompt, enter the following to assign the string ‘apple’ to the variable ‘a’ a <- ‘apple’ 3. Then, enter the following to show the value inside the variable ‘a’ a Command Action ls() List all variables in the environment. rm(x) Remove x from the environment. rm(list = ls()) Remove all variables from the environment. # Single line comments Note: R commands are case-sensitive. Type Ctrl+L to clear the screen of console. STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 8 Getting started with Script editor 1. To begin, click the + button to create a new Rscript. A dropdown will appear and click R Script. 2. We shall repeat what we have tried in the previous console example in the Script Editor. Click the white empty space next to Line 1 and a cursor will appear. Enter following to assign the string ‘apple’ to the variable ‘a’ a <- ‘apple’ 3. Then, enter the following to show the value inside the variable ‘a’ a 4. Click the save button to save the Rscript. A file browser will appear. Save the file to the current working directory in *.R format. Then click the run button. STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 9 5. In the environment tab (right hand side of Rstudio), the variable ‘a’ and its value “apple” should appear. R markdown R Markdown is a file format for making dynamic documents with R. An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. Fig. 1.1 shows a R markdown file generated from Ex 1.1. Fig. 1.1 R Markdown file of Example 1.1. STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 10 1. To begin, go to File -> New File -> R Markdown 2. You can choose between exporting to html, pdf or MS word file. Type the title and author of your R markdown file, choose html and then click OK 3. Delete summary(cars) and plot(pressure) and replace with your own R codes. In R markdown, a way to display R codes is to display in chunks that lies within ```{r } ````. Options can be added too. For example, echo=False will prevent source code from being displayed. STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 11 4. Delete all the descriptions and modify all necessary places 5. Click the Knit button to export the R markdown to a html file. 6. The output html will look like this STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 12 Vectors Operations In R, vectors do not come with column/row attribute. When it is multiplied with a matrix using the command %*%, R will interpret the vector in whichever way makes the matrix product conformable. 1. Creating Vectors Command Action Output c(2,4,6) Join elements into a vector 2 4 6 2:6 An integer sequence 2 3 4 5 6 seq(2,3, by=0.5) A complex sequence 2.0 2.5 3.0 rep(1:2, times=3) Repeat a vector 1 2 1 2 1 2 rep(1:2, each=3) Repeat elements of a vector 1 1 1 2 2 2 2. Vector functions Command Action sort(x) Return x sorted in ascending order sort(x, index.return =TRUE) Return x sorted together with the index as $ix sort(x, decreasing =TRUE) Sort x in descending order table(x) See counts of values rev(x) Return x reversed unique(x) See unique values length(x) Return dimension of vector Random number generators rnorm(4) Generate a vector of dimension 4 with random numbers generated from normal distribution with mean 0 and variance 1 rpois(10,1) Generate a vector of dimension 10 with random numbers generated from a poison distribution with mean count of 1 sample(1:100,3,replace=TRUE) Generate a vector of dimension 3 with random numbers generated from uniform distribution with replacement 3. Selecting Vector Elements Command Action By Position STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 13 x[4] The fourth element x[-4] All but the fourth x[2:4] Elements two to four x[-(2:4)] All elements except two to four x[c(1,5)] Elements one to five. By Value x[x==10] Elements which are equal to 10 x[x<0] All elements less than zero x[x %in% c(1,2,5)] Elements in the set 1,2,5. Named Vectors x[‘apple’] Element with name ‘apple’ IMPORTANT !! Before you begin each chapter, always start a new project. 1. Go to File -> New Project 2. Choose New Directory STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 14 3. Choose New Project 4. After typing your directory name and choosing the subdirectory, click Create Project STAT S263F Big Data Analytics and Applications Tutorial 1: Introduction to Big Data Most materials are from reference books. For use in lectures only. Not for duplication Page | 15 Example 1.1 Vectors In the following example, we shall practise vector operations in R. Take a look at the following program with the line numbers at the left as follows: 1. # Example 1.1 2. x<-rnorm(10) 3. y2<-x[5] 4. y3<-x[-8] 5. y4<-x[2:6] 6. y5<-x[-(1:3)] 7. y6<-x[c(1,3,5)] 8. y7<-x[x<0] Each line represents a whole R statement. The first line # Example 1.1 is called a comment, which is used to document a program and to make the program readable and understandable. Comments are not interpreted in R and do not cause the computer to run. Currently, R only support single- line comment. At line 2, rnorm(10) generates a 10 dimension vector containing random numbers drawn from normal distribution with zero mean and variance 1. x<-rnorm(10) assigns the vector to a new variable called x. Lines 3-8 concerns with data manipulation within the vector x. At line 3, y2<-x[5] extracts the 5th element from vector x and stores them to a new variable called y2. At line 4, y3<-x[-8] extracts all elements from vector x except the 8th element and stores them to y3. y4<-x[2:6] in line 5 extracts the 2nd to 6th elements and store them to y4. y5<-x[-(1:3)] in line 6 collects all elements from vector x except 1st to 3rd elements and store them to y5. At line 7, y6<-x[c(1,3,5)] collects the 1st, the 3rd and the 5th elements and store them to y6. Finally, y7<-x[x<0] extracts elements in vector x that are smaller than zero and store them to y7. Tutorial Assignment 1.1 Vectors 1. Create a random vector of integers uniformly distributed from 1 to 20 and store it as vector x. 2. Extract the 3rd element of the vector x and store it as y2. 3. Extract all elements of x excluding the 5th element and store them as vector y3. 4. Extract the 6th – 10th elements for x and store them as vector y4 5. Extract all elements of x excluding the 6th – 10th elements and store them as vector y5. 6. Extract the 2nd, 5th and 11th element from x and store them as vector y6. 7. Extract elements of x which are larger than 11 and store them as vector y7. 8. Save the file as