程序代写案例-STSCI3040
时间:2021-10-08

R Programming for Data Science STSCI3040 - STSCI 5040 Fall 2021 1 This page intentionally blank 2 Statistics courses usually use clean and well-behaved data, this leaves many unprepared for the messiness and chaos of data in the real world. This course aims to prepare students for dealing with data using the R programming language. The introduction will overview the basic R syntax, foundational R programming concepts such as data types, vectors arithmetic, and indexing, importing data into R from different file formats. The data wrangling topics include how to tidy data using the tidyverse to better facilitate analysis, string processing with regular expressions and with dates and times as file formats, web scraping, and text mining. Data visualization topics will cover visualization principles, the use of ggplot2 to create custom plots, and how to communicate data-driven findings. 3 4 Chapter 1: RStudio RStudio is made of four panes as follows: 1. Console 2. Source 3. Environment/History 4. File/Packages/Help 5 6 Chapter 2: Basic RMarkdown --- title: “Untitled” author: “Jeremy Entner” date: “8/25/2021” output: word_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # R Markdown This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: ```{r cars} summary(cars) ``` ## Including Plots You can also embed plots, for example: ```{r pressure, echo=FALSE} plot(pressure) ``` Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. 7 Chapter 3: Numerical Data Numerical Data comes in THREE forms: 1. Integer 2. Floating Point/Double 3. Complex We won’t worry about complex numbers. An integer is any whole number, positive or negative. As an integer, it will be recognizable as a number followed by a capital L. For example, the integer three would appear as 3L. A floating point/double number is any whole number or fractional number. The floating point/double version of the number three would simply appear as 3. The real difference between these two types is determined by how the computer stores the numbers. For the most part, we can ignore the difference between integers and floating point numbers. Numerical Vectors Collections of numerical data can be held in an ordered numerical (atomic) vector, in the same way that logical data. An ordered numerical vector can be created using the concatenate function (). Make sure to separate each numerical data value with a comma. c(1, 4, -10, 3.5) [1] 1.0 4.0 -10.0 3.5 Mathematic Operations Many standard mathematical operations can be performed on numerical vectors. The most basic would be addition. 1 + 2.3 [1] 3.3 While it looks like we added two number, what really happened was that R added together two numerical vectors that contained a single value each, namely 1 and 2.3 . If we had wanted to add two longer vectors together, we could have done so # Addition - ?Arithmetic c(0, 2, 2.4) + c(1, -2, 0.6) [1] 1 0 3 The plus sign + performs a vectorized addition. Corresponding pairs of values in each vector are added together. The first values of 0 and 1 are added to give first value (1) in the result. The second values of 2, and -2, are added to give the second value (0) in the result. Finally, the last values of 2.4, and 0.6 in each vector are added together to give the final value (4) in the result. The basic mathematical operations that we will look at are vectorized. # Subtraction - ?Arithmetic c(0, 2, 2.4) - c(1, -2, 0.6) [1] -1.0 4.0 1.8 # Multiplication - ?Arithmetic c(0, 2, 2.4) * c(1, -2, 0.6) [1] 0.00 -4.00 1.44 # Division - ?Arithmetic c(0, 2, 2.4)/c(1, -2, 0.6) [1] 0 -1 4 If parentheses are used, multiple operations can be combined. The order of operations can be described using PEMDAS. 8 Some more complex operations are also vectorized. However, for clarity, we will demonstrate them with single valued vectors. # Absolute Value - ?MathFun abs(-7.1) [1] 7.1 # Square Root sqrt(4) [1] 2 # Exponents - ?Arithmetic 3^2 [1] 9 # Natural Log & Exponential- ?log log(4.8) [1] 1.568616 exp(1.569) [1] 4.801844 # Trig functions - ?Trig cos(2) [1] -0.4161468 # Quotient and Remainder - ?Arithmetic 10%/%3 [1] 3 10%%3 [1] 1 Recycling For operations with two inputs, if one input has fewer elements than the other, will recycle (repeat) the shorter vector, until it is long enough to perform the vectorized operation. c(0, 2, 3.4, 10) + c(1, 0) [1] 1.0 2.0 4.4 10.0 In the first line, R repeated the vector with two elements. The second vector was treated like c(1, 0, 1, 0). c(0, 2, 3.4, 10) + c(1, 0, 7) Warning in c(0, 2, 3.4, 10) + c(1, 0, 7): longer object length is not a multiple of shorter object length [1] 1.0 2.0 10.4 11.0 In this example, a warning is given. It indicates that the longer vector is not a multiple of the shorter vector, in terms of length. Three does not divide evenly into four. The vector of length three was partially recycled as c(1, 0, 7, 1). Two Interesting Facts 1. The number of digits displayed can be controlled. However, It doesn’t change the representd value 2. The number is stored in R. It can be accessed by typing . options(digits = 20) #?options pi #?Constants [1] 3.141592653589793116 9 Chapter 4: Assignment Operator The assignment operator < − allows you to efficiently keep track of values used in computations and results produced by computations. Consider adding these vectors (0,2,2.4) and (1, −2,0.6) c(0, 2, 2.4) + c(1, -2, 0.6) [1] 1 0 3 The result is (1,0,3). Suppose you wanted to save this result or perform further computations using it. The operator will allow you to do this. In particular, it will allow this without the need to retype the vector. (1,0,3) is short, only three values. A result with 1000 values would not be very easy to retype. The assignment operator can be used to attach a variable name to the result. output <- c(0, 2, 2.4) + c(1, -2, 0.6) To read this line of code, start on the right of the < − sign. First, (0,2,2.4) + (1, −2,0.6) is computed. Second, the assignment operator < − assigns the result to a variable called output. Notice two things about this: 1. The result is not displayed. 2. A new value called output appeared in the environment pane. Once a variable name has been assigned to a value, the two become synonymous. When executed, the variable output produces the values stored in it. output [1] 1 0 3 Operations can be performed on it, functions can be applied to it, as if the original vector of values was being used. output + 1 [1] 2 1 4 sin(output) [1] 0.841471 0.000000 0.141120 In fact, a variable can be used to update itself and add new values. output [1] 1 0 3 output <- c(output, 999) # a fourth element is added to output output [1] 1 0 3 999 The right hand side of < − takes the original values associated with the output variable, adds one to all of them, and then associates the result with the variable output. Effectively, the old values are erased, and replaced. 10 Several Variables and Multiple operation Below, we assign variable names, perform some operations using the variables. The results have variable names assigned to them. Then, an operation is performed on the constructed results. x <- c(0, 2, 2.4) # Assign the variable name x to this vector y <- c(1, -2, 0.6) # Assign the variable name y to this vector output.add <- x + y # Add x and y together, and assign a name to the result. output.difference <- x - y # Subtract y from x, and assign a name to the result. output.add * output.difference # Multiply output.add and output.difference together [1] -1.0 0.0 5.4 (Theses operations were selected for demonstration purposes only.) Rules for Variable Names There are some rules for variable names that must be followed: 1. The name must start with a letter. 2. The name can only contain letters, numbers, underscores, and periods. There are also some general guidelines for variables names. They will make reading and updating code much easier. 1. If a value, (vector of values, or any object that you can construct) is going to be used multiple times, assign it a variable name. If you discover a mistake in that value, correct the mistake where the variable name is assigned. No need to search all of your code. 2. The variable name should be assigned before every instance of it being used in your code. R reads code from top to bottom. test.variable # This variable will not exist until the next line. Error in eval(expr, envir, enclos): object 'test.variable' not found test.variable <- 3 test.variable [1] 3 c. Use a name that describes what the values represent. At a later time, the descriptive name it easier to understand the variables use. d. Pick a convention for writing names and use is consistently. variablename VARIABLENAME #all lower/UPPER case, words not separated, hard to read variable.name #all lower case, words separated by a period, easy to read variable_name #all lower case, words separated by an underscore, easy to read variableName #Camel case, words not separated, Capitalize first letter of each word except possible the first, easier to read 5. Select names for your variables that do not already have some meaning in R. pi # This is a constant that is already defined in R. [1] 3.141593 pi <- 10 # This assigns a new value to the name pi pi [1] 10 Other Assignment Operators Other assignment operators exist. 1. The single equal sign =. This works like <-, but doesn’t visually does not indicate what is occurring as well as <-. 2. -> assigns left-hand values into a variable listed on the right. If the value is long, it is hard to find the variable name. 3. <<-, ->> are assignment operators that may be used when making functions. 11 Chapter 5: Character Data Character Data appears in two forms. It will appear as text surrounded by: 1. “Double quotes” or, 2. ‘Single quotes.’ Commonly, character data are referred to as string data, or strings. Strings can contain any character you can type on the keyboard. Strings can even contain no characters. This is indicated by a set of quotes without any characters or spaces in between them. "" #This is an empty string [1] "" " " #This string contains a space. [1] " " "This is a string. It contains words." [1] "This is a string. It contains words." Character Vectors Collections of character data can be held in an ordered character (atomic) vector, in the same way as logical and numerical data. An ordered character vector can be created using the concatenate function (). Make sure to surround each string with quotes and separate each string with a comma. c("String", "Elements", "Are", "Wrapped", "In", "Quotes.", 9.7, FALSE) [1] "String" "Elements" "Are" "Wrapped" "In" "Quotes." "9.7" [8] "FALSE" Operating on Strings The () function is used to join at least two strings together into a single string. paste("String", "Elements", "Are", "Wrapped", "In", "Quotes") [1] "String Elements Are Wrapped In Quotes" In this case, a space is used to separate the combined elements. If the = argument is added, a different string can be used to separate the combined elements. paste("String", "Elements", "Are", "Wrapped", "In", "Quotes", sep = "_

[1] "String_Printing Strings

When entered on its own, a string will automatically display itself. However, in some situation, you must force the display.
This can be done using several functions.
print("This is a string.")
[1] "This is a string."
noquote("This is a string without printed quotes.")
[1] This is a string without printed quotes.
The () function can be used to print a string without the line number printed and without quotes. It will also interpret
formatting symbols.
cat("This line has a \t tab in it. \n Now, we have skipped a line and included ê.")
This line has a tab in it.
Now, we have skipped a line and included ê.
12
Modifying Strings
Strings can be modified in several ways. The case of the letters in string can be changed. (These functions may not give the
same results on two different computers, if the underlying locales are set differently.)
tolower("ChanGE cAsE 1234")
[1] "change case 1234"
toupper("ChanGE cAsE 1234")
[1] "CHANGE CASE 1234"
Character replacement can be performed with ℎ(). The old argument gives the old values to be replaced. The new
argument indicates what they are translated into. The final string is what need to be translated.
chartr(old = "EhC", new = "eHX", "ChanGE cAsE 1234")
[1] "XHanGe cAse 1234"
chartr(old = "abcX", new = "DEFx", "abcdefghijklmnopqrstuvwXyz")
[1] "DEFdefghijklmnopqrstuvwxyz"
Extracting Strings
It may occur that just a single string contains several pieces of information. You may want to separate these pieces or
extract the single piece that you need.
To extract part of a string, use the () command. It requires that you know the numerical position of the characters
that you want in the string (including spaces).
This code will extract the second, third, fourth and fifth value from the given string.
substring("abcdefghijklmnopqrstuvwxyz", first = 2, last = 5)
[1] "bcde"
In some cases, the portion that you want to extract might not have a known length. However, you may know that there will
only be a certain number of characters, say 4, after it. Use the ℎ() function to determine the number of characters in
the string, and subtract 4 from this value.
string.with.equal.signs <- "=I Want This Part.==="
string.length <- nchar(string.with.equal.signs)
substring(text = string.with.equal.signs, first = 2, last = string.length - 3)
[1] "I Want This Part."
() can also be used to replace one part of one substring with another.
substring(text = string.with.equal.signs, first = 2, last = string.length - 3) <- "++++++"
string.with.equal.signs
[1] "=++++++ This Part.==="
If there is a regular pattern to how the information is given in the string, it can be split apart.
strsplit("Age , Height , Weight , Other", split = " , ")
[[1]]
[1] "Age" "Height" "Weight" "Other"
Two things we should notice about this:
1. The string was split using " , " instead of “,” The first would leave a spaces on each resulting string.
2. The result has two sets of indexes [[1]] and [1]. This is because the result is not a vector, but a different object for
holding data called a list.

13
Chapter 6: More Character Operations - stringr
The package adds a number of functions for manipulating strings. Functions built into the package are all
named beginning with “str_.” Additionally, the arguments appear in a consistent order from function to function.
library(stringr)
Joining & Splitting Strings
The _() function joins multiple strings into one.
stringr::str_c("Number: ", c(1, 2, 3, 4, 5, 6))
[1] "Number: 1" "Number: 2" "Number: 3" "Number: 4" "Number: 5" "Number: 6"
str_c(c("a", "b", "c"), c(1, 2, 3, 4, 5, 6), sep = "_")
[1] "a_1" "b_2" "c_3" "a_4" "b_5" "c_6"
str_c(c("a", "b", "c"), c(1, 2, 3, 4, 5, 6), "X", sep = "_", collapse = "^")
[1] "a_1_X^b_2_X^c_3_X^a_4_X^b_5_X^c_6_X"
It works in a vectorized fashion, joining elements that have a common index. is used to insert a string between the
elements. is used to combine all resulting elements into a single string.
The _() function will collapse the separate elements of a vector into a single string.
str_flatten(string = c("pear", "orange", "berry", "nut"))
[1] "pearorangeberrynut"
str_flatten(string = c("pear", "orange", "berry", "nut"), collapse = " - ")
[1] "pear - orange - berry - nut"
The argument indicates the string to be inserted between each element.
The _() function breaks a string into several substrings based upon some given pattern.
str_split(string = "pear - orange - berry - nut", pattern = " - ", n = 3, simplify = FALSE)
[[1]]
[1] "pear" "orange" "berry - nut"
The _() function is a vectorized function duplicates and concatenates strings. Its second argument indicates
how many times each element of the given vector is duplicated.
str_dup(string = c("pear", "orange", "berry", "nut"), times = c(1, 2, 3, 4))
[1] "pear" "orangeorange" "berryberryberry" "nutnutnutnut"
The _ℎ() function will return the number of characters in each element of a given vector. This is similar in function
to ℎ().
str_length(string = c("pear", "orange", "berry", "nut"))
[1] 4 6 5 3
Substrings - Extracting, Modifying & Padding
In the package, the _() function extracts substrings from a character vector. The substrings will start with
the character in the position indicated by and end with the character in the position indicated by . Failure to
indicate either will be seen as an indication to select from the first character, or continue to the final character.

14
x <- c("abcdefg", "ABCDEFGH", "123456789")
str_sub(string = x, start = 2, end = 5)
[1] "bcde" "BCDE" "2345"
str_sub(string = x, end = -4)
[1] "abcd" "ABCDE" "123456"
str_sub(string = x, start = 2, end = -2) <- "A"
x
[1] "aAg" "AAH" "1A9"
A negative number in the position will indicate that the selection should be made by counting backwards from the end
of a given string.
The _() function can be used to add characters to a string until the string attains a minimum length. The example
below will add some characters ( “Z” ) to both sides of these foods until they have a minimum length of 6 characters.
str_pad(string = c("pear", "orange", "berry", "nut"), width = 6, side = "both", pad = "Z")
[1] "ZpearZ" "orange" "berryZ" "ZnutZZ"
Characters can be added solely to the “left” or “right” sides of the given strings. The function is vectorized in the
argument, which lists a vector of single characters used to extend the given strings.
Whitespace - Trimming & Squishing
When strings are split into substrings, extraneous white spaces are included in the included substrings.
The _() function will remove extraneous white spaces white spaces on the “left” side, “right” side, or “both” sides of
the given string depending upon the argument .
x <- " <--Extra Spaces Here--> <--Here--> "
str_trim(string = x, side = "both")
[1] "<--Extra Spaces Here--> <--Here-->"
str_squish(string = x)
[1] "<--Extra Spaces Here--> <--Here-->"
The _ℎ() function removes all extra white spaces on either end of, and inside of, a given string. In the examples,
pay close attention to the placement of the quotation marks.
Truncation
The _() function will truncate a given character string, and pastes an ellipsis onto the result. The ℎ argument
determines the length of the overall result. The ℎ argument minus the length of the argument determines the
number of characters kept for the original character string. The can be placed on the “right,” “left,” or “center.”
x <- "This string has a lot of characters in it."
str_trunc(string = x, width = 11, side = "right", ellipsis = ".....")
[1] "This s....."
Sorting
The _() function sorts elements of a character vector. The argument determines whether the sorting is
in decreasing or increasing order according to some system. defaults to sorting in English. Other locales can be set.
This will affect the ordering used. . indicates where values fall in the ordering. controls whether digits
are treated as numbers or as strings.
str_sort(x = c("pear", "orange", NA, "berry", "nut"), decreasing = FALSE, na_last = TRUE,
locale = "en", numeric = FALSE)
[1] "berry" "nut" "orange" "pear" NA

15
Chapter 7: Logical Data
Logical Data takes in TWO values:
1. TRUE (T)
2. FALSE (F)
Logical data can be used to encode the answer to a Yes/No question. Use TRUE for Yes, and FALSE for No, When using
logical data, it must be entered in uppercase letters. Otherwise, R will no recognize it.
Collections of logical data can be held in an ordered (atomic) vector. A logical vector can be created using the concatenate
function c(). Make sure to separate each logical data value with a comma.
c(TRUE, TRUE, FALSE, FALSE)
[1] TRUE TRUE FALSE FALSE
c(T, F, T, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F)
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
In fact, a single logical value on its own is a vector. It just happens to be a vector with one element.
Logical Operators
Computations using logical data can be performed.
The simplest operation is called logical negation or NOT. It turns a TRUE value into FALSE, and a FALSE value into TRUE.
!TRUE
[1] FALSE
!FALSE
[1] TRUE
!c(TRUE, FALSE)
[1] FALSE TRUE
Other logical operations take two arguments(inputs). A logical AND determines if both inputs are TRUE. A logical OR
determines if at least one input is TRUE. The exclusive logical OR, determines if exactly one of the inputs is TRUE. Each one
of these operations results in a logical value.
# logical AND &&
TRUE && TRUE
[1] TRUE
# logical OR ||
TRUE || TRUE
[1] TRUE
xor(TRUE, T)
[1] FALSE
The operations indicated by && and || will only look at the first element of each vector that is given to it.
These operations can be used in sequence (taking the result of one operation and plugging it into another). You just need to
enclose the first operation inside parentheses.
(TRUE && !TRUE) || TRUE
[1] TRUE

16
Vectorized Operations
If you want to perform these operations in a pairwise(element-wise) fashion, you want to use the vectorized version of
these operators.
# Vectorized AND
c(TRUE, TRUE, FALSE, FALSE) & c(T, F, T, F)
[1] TRUE FALSE FALSE FALSE
# Vectorized OR d
c(TRUE, TRUE, FALSE, FALSE) | c(T, F, T, F)
[1] TRUE TRUE TRUE FALSE
# exclusive logical OR
xor(c(TRUE, TRUE, FALSE, FALSE), c(T, F, T, F))
[1] FALSE TRUE TRUE FALSE
As you can see, xor() will work with single element vectors or vectors with more elements.
Recycling
For operations with two inputs, if one input has fewer elements than the other, R will recycle (repeat) the shorter vector,
until it is long enough to perform the vectorized operation.
c(TRUE, TRUE, FALSE, FALSE) & c(T, F)
[1] TRUE FALSE FALSE FALSE
In the first line, R repeated the vector with two elements. The second vector was treated like c(T,F,T,F).
c(TRUE, TRUE, FALSE, FALSE) | c(T, F, T)
Warning in c(TRUE, TRUE, FALSE, FALSE) | c(T, F, T): longer object length is not
a multiple of shorter object length
[1] TRUE TRUE TRUE TRUE
In this example, a warning is given. It indicates that the longer vector is not a multiple of the shorter vector, in terms of
length. This is true. The longer vector had a length of 4, while the shorter had a length of three. In this case, the vector with
length three was only partially recycled. To perform the computations, the second vector was treated like c(T,F,T,T).
Any & All
Given a vector of logical data, you may wonder if any or all of the values are TRUE
any(c(F, T, F, F))
[1] TRUE
any(c(F, F, F, F))
[1] FALSE
all(c(T, T, T, T))
[1] TRUE
all(c(T, T, T, F))
[1] FALSE
One Interesting Fact
In certain contexts, R will treat logical data values as numbers. Namely, TRUE will be treated as one(1) and FALSE will be
treated as zero(0).

17
Chapter 8: Comparisons
Given two data values (logical, numeric, character) it will be important to determine how they relate two each other. Are
these values the same? Is one greater than the other? Does one come before the other?
The comparison operators can be used to answer these questions. They take any two data values, and return a logical value
indicating the validity of the comparison.
As with many of the operators we have seen before, these operators are vectorized, and will recycle shorter vectors.
Ordering (<,<=,>=,>)
To order numeric values, we ask if one value is less than < or greater than > another value. If we want to include equality,
add on an equal sign. Consider the possible comparisons between the numbers 0 and 1.
# less than ?Comparison
0 < 1
[1] TRUE
1 < 0
[1] FALSE
# less than or equal to
0 <= 1
[1] TRUE
1 <= 0
[1] FALSE
# greater than ( or equal to)
c(0, 1) > 1
[1] FALSE FALSE
c(0, 1, 1, 0) >= c(1, 0)
[1] FALSE TRUE TRUE TRUE
In the last line, realize that the (1,0) vector on the right hand side is being recycled. The comparisons that are being
performed look (invisibly) like (0,1,1,0) >= (1,0,1,0).
These ordering operators are not limited to comparing numeric data only. They can be used on any of the data types.
Strings will use an alphabetic ordering according to the locale setup in your computer.
Equality
Checking whether two values, or vectors, are the same can be tricky. The obvious choice for checking equality would be the
equal sign =. However, the equal sign on its own performs a different function. The appropriate operator is a double equal
sign ==.
7 == 14/2
[1] TRUE
c("a", "b") == c("a", "b", "a", "b")
[1] TRUE TRUE TRUE TRUE
c("a", "b") == c("a", "B", "a", "Horse")
[1] TRUE FALSE TRUE FALSE
The double equal sign will return a logical value, or vector of logical values. It is vectorized and will recycle a shorter vector
if needed. As we can see, it will compare character vectors. It will also compare logical vectors. In each case, it compares
corresponding elements.
18
The double equal sign does have some things to be aware of:
1. == looks for exact equality in the computers representation of a value.
Because of how a computer stores numbers, two values that mathematically are the same, may not be the same when
stored. Consider the square root of 2.
sqrt(2) * sqrt(2) == 2L
[1] FALSE
sqrt(2) * sqrt(2)
[1] 2
Theoretically, this should be TRUE. A computer approximates the square root of 2. When this approximation is squared, a
small amount error is included. It will approximate 2. As a remedy, we might ask if two values are almost equal or nearly
equal. This can be done using the . () function in base R or the () function included in the package. Both
of these functions have a small tolerance for some difference between two values.
all.equal(sqrt(2) * sqrt(2), 2)
[1] TRUE
dplyr::near(sqrt(2) * sqrt(2), 2)
[1] TRUE
Both functions can be used on vectors of values, but the outputs are very different. . () will return TRUE if the
difference in all pairs is within the given tolerance. However, it returns a string giving the average relative difference.
all.equal(c(1.000000004, 2.000000009), c(1.000004, 2.000000008))
[1] "Mean relative difference: 1.332333e-06"
: : () will return a logical vector containing the results for each pairing.
dplyr::near(c(1.000000004, 2.000000009), c(1.004, 2.000000008))
[1] FALSE TRUE
2. == does not compare the vectors as a whole.
c("a", "b") == c("a", "b", "a", "b")
[1] TRUE TRUE TRUE TRUE
If we allow for recycling of elements in the shorter vector, the elements of these two vectors match up. But, as a whole
these vectors are not the same. As a means of checking if any two R object (not just vectors) are the same, we can use the
() function.
identical(c("a", "b"), c("a", "b", "a", "b"))
[1] FALSE
This function will determine if two elements are exactly the same. It will check many things not visible to the eye.
identical(2, 2L) #These are different types of numerical data.
[1] FALSE
identical(2, "2") #These are different types of data
[1] FALSE
2 == "2" #These are different types of data, but one is coerced into the other type.
[1] TRUE

19
Chapter 9: Vectors
We have encountered four types of (atomic) vectors depending on the type of data contained in it.
1. Logical
2. Numerical - Integer/Floating Point
3. Character
If more than one type of data is included in an (atomic) vector, R will coerce all the data into a single type. () will
identify the type of vector.
typeof(c(1, 2, 3, 4, TRUE))
[1] "double"
typeof(c("A", "B", "Character", 6, FALSE))
[1] "character"
Vector Attributes
Besides the data explicitly contained in a vector, other pieces of information can be attached to the vector. Names can be
given to each data value. Names can be added immediately as the data is being entered.
exampleData <- c(data1 = 2.2, data2 = 9, data3 = -1)
exampleData
data1 data2 data3
2.2 9.0 -1.0
If the names need to be attached, or updated, this can be done using the () function. It will require a character vector
with a name for each piece of data. The names should appear in the same order as the data they should be attached to.
names(exampleData) <- c("newName1", "newName2", "newName3")
exampleData
newName1 newName2 newName3
2.2 9.0 -1.0
Comments can be added to describe the contents of the vector using the () function. Comments should be written
as strings. The () function displays the comments also.
comment(exampleData) <- "This is a comment."
comment(exampleData)
[1] "This is a comment."
The () function can be used to display a summary of these additional attributes.
attributes(exampleData)
$names
[1] "newName1" "newName2" "newName3"

$comment
[1] "This is a comment."
Numerical Sequences
A numerical vector containing a regular sequence of numbers is often needed. The most basic is a sequence of consecutive
integers. This is generated with a colon separating the first and last integer in the sequence.
seqentialIntegers <- 3:8
seqentialIntegers
[1] 3 4 5 6 7 8

20
For other regular sequences, you want to use the () function. It can by specifying three out of four arguments.
1. / - the starting / end values, at least one of these is needed.
2. - the size of the increment between values
3. ℎ. - the number of values to be produced.
seq(from = 4, to = 6, length.out = 9)
seq(from = 4, to = 6.1, by = 0.25)
seq(from = 4, by = 0.25, length.out = 9)
seq(to = 6, by = 0.25, length.out = 9)
All of these produce the same result.
[1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
Replicating Vectors
Sometimes, values need to be repeated. The () function creates a new vector whose elements are replications of a given
vector.
givenVector <- c("a", 9)
rep(givenVector, times = 3)
[1] "a" "9" "a" "9" "a" "9"
rep(givenVector, each = 3, times = 2)
[1] "a" "a" "a" "9" "9" "9" "a" "a" "a" "9" "9" "9"
rep(givenVector, each = 3, times = 2, length.out = 10)
[1] "a" "a" "a" "9" "9" "9" "a" "a" "a" "9"
Combining Vectors
New vectors can be created by concatenating using variable names.
x <- 7:4 #Sequential Vector
y <- rep(9, times = 3) #Replicated Vector
newVector <- c(x, y, 10, x)
newVector
[1] 7 6 5 4 9 9 9 10 7 6 5 4
This code chunk creates a vector containing 7 down to 4, a vector replicating the number 9 three times. Finally, the
vector newVector is created with elements from , then from , followed by 10, and finishing with the elements from .
Placeholder Vectors
When computations are repeated with different values, the results will need to be stored. If you know how many different
sets of computations are to be performed, it can be useful to have a vector that can be filled in with the results.
resultsVector1 <- vector(mode = "logical", length = 5)
resultsVector2 <- rep(NA, length = 5) # NA stands for Not Available. It is used for missing values.
resultsVector1
resultsVector2
[1] FALSE FALSE FALSE FALSE FALSE
[1] NA NA NA NA NA
The mode in the () function can be replaced with integer, double, or character.
Vector Length
The number of elements in a given vector can be determined with the ℎ() function.
length(resultsVector1)
[1] 5

21
Chapter 10: Filtering & Subsetting Vectors
Given any type of vector with any number of elements, at times you will want to work with only smaller portion of the
elements in the given vector. This can be done numerical values or with logical values.
Numerical Subsetting
To indicate which portion of a given vector you want to use, needs to be given the indexes or positions of the desired
elements need. The index is the numerical position as you count through the given vector.
givenVector <- c("a", "b", "c", "d", "e", "F", "G", "H", "I", "J", "k", "l", "m",
"n", "o", "P", "q", "r", "s", "T", "a", "b", "c", "a")
givenVector
[1] "a" "b" "c" "d" "e" "F" "G" "H" "I" "J" "k" "l" "m" "n" "o" "P" "q" "r" "s"
[20] "T" "a" "b" "c" "a"
To determine the index of a value, count which element it is. helps with this. At the head of each line, an integer is in
square brackets. That integer indicates the index of the first element in that line. The capital T has an index of [20].
To select a single element from a vector, follow a straight forward syntax. Start with the name of the vector, followed by a
set of square brackets that contain the index of the desired element.
# Selecting the 20th element
givenVector[20]
[1] "T"
A subset does not need to be restricted a single element. A numerical vector containing multiple values can be used.
# Selecting the 20th element
givenVector[c(1, 2, 20, 23, 24)]

selectThese <- c(1, 2, 20, 23, 24)
givenVector[selectThese]
[1] "a" "b" "T" "c" "a"
It may be that you want to select all elements from a given vector except for a given set. Use the same procedure as before,
but use negative indexes
# Selecting the 20th element
givenVector[-c(1, 2, 20, 23, 24)]

givenVector[-selectThese]
Either of these will produce:
[1] "c" "d" "e" "F" "G" "H" "I" "J" "k" "l" "m" "n" "o" "P" "q" "r" "s" "a" "b"
The ℎℎ() function can be used to determine the indexes of all elements that satisfy a certain condition.
# Determine the index where 'a' elements are located.
which(givenVector == "a")
[1] 1 21 24
LOCATEa <- which(givenVector == "a")
givenVector[LOCATEa]
[1] "a" "a" "a"
Subsetting by Name
When names are assigned, a character vector inserted inside square brackets selects values.
exampleData <- c(data1 = 2.2, data2 = 9, data3 = -1)
exampleData[c("data2", "data3")]
data2 data3
9 -1
22
Reordering by filtering
The values in the index vector do not have to be arranged in an increasing order. They can be used to reorder all, or a
subset of a vector.
givenVector <- c("z", "y", "x", "a", "b", "c")

givenVector[c(3, 2, 1)] #List the third, then second, then first value
[1] "x" "y" "z"
The () function will reshuffle the indexes of all elements, so that the elements can be put in order
order(givenVector)
[1] 4 5 6 3 2 1
alphabeticalOrder <- order(givenVector)
givenVector[alphabeticalOrder]
[1] "a" "b" "c" "x" "y" "z"
Logical Subsetting
When using logic to subset/filter a vector, you must provide a or value for each element in the given vector.
givenVector <- 1:6
givenVector
[1] 1 2 3 4 5 6
selectThese <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)
givenVector[selectThese]
[1] 1 3 5
Using this logical vector, only the odd numbers were selected from the numbers. The same result can be produced if we set
up a comparison that will produce the same vector of and values. This comparison will determine f the
remainder is 1 when each given value is divided by two.
givenVector%%2 == 1
[1] TRUE FALSE TRUE FALSE TRUE FALSE
selectOdd <- givenVector%%2 == 1
givenVector[selectOdd]
[1] 1 3 5
A more readable version of this can be seen if you use the () function.
# Argument \#1 is the vector to subset. Argument \#2 is the subset criteria.
subset(givenVector, givenVector%%2 == 1)
[1] 1 3 5
Replacing elements by filtering
When some elements in a vector need updating, filtering can be used to identify the elements, and then new values can be
assigned.
dat <- c(1, 2, 3, NA, NA, 6, NA) # NA represents missing values in positions 4,5, and 7.
dat
[1] 1 2 3 NA NA 6 NA
dat[c(4, 5, 7)] <- c(104, 105, 107) # Replace the NA values
dat
[1] 1 2 3 104 105 6 107

23
Chapter 11: Matrices & Arrays
Matrices and Arrays are more general versions of vectors. If you think of a vector as a set of numbers written on a line, then
think of a matrix as a set written out on a grid. If you start stacking matrices, then a three dimensional array is produced.
However, from point of view, a matrix or an array is a vector with a dimension attribute. The dimension information
indicates how the values are indexed.
Matrices
To Construct a matrix, the matrix() function is used. Its first argument is the data to be included in the form of a vector.
The next two arguments nrow and ncol indicate the number of rows and number of columns included. Only one if these is
required, the other can be found by looking at the length of the data vector. The last argument indicates how the
matrix is filled in.
dat <- 1:24
mat.byrow.FALSE <- matrix(data = dat, nrow = 4, ncol = 6, byrow = FALSE)
mat.byrow.FALSE
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 5 9 13 17 21
[2,] 2 6 10 14 18 22
[3,] 3 7 11 15 19 23
[4,] 4 8 12 16 20 24
In this example, is set to . Therefore, the columns are filled first from top to bottom, then moving left.
Looking at the margins of the matrix, the square bracketed values indicate the two part index for each value. An individual
value is indicated by its row index, followed by a column index. The values should be separated by a comma. Vectors of
multiple values can be used to filter a matrix.
# Values found in rows 1 and 4 AND columns 2, 3, 4, and 5.
mat.byrow.FALSE[c(1, 4), 2:5]
[,1] [,2] [,3] [,4]
[1,] 5 9 13 17
[2,] 8 12 16 20
# No values from rows 2 or 3 are include. Neither are values from column 1.
If all values in a subset of rows are desired, the column indices don’t need to be specified. However, the comma can not be
omitted. If a subset of columns is desired, the row values can be omitted.
# Rows 1 and 4, all columns
mat.byrow.FALSE[c(1, 4), ]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 5 9 13 17 21
[2,] 4 8 12 16 20 24
# Columns 2 and 3, all rows
mat.byrow.FALSE[, c(2, 3)]
[,1] [,2]
[1,] 5 9
[2,] 6 10
[3,] 7 11
[4,] 8 12

24
If the argument is set to , the rows are filled from left to right, starting with the top row and moving down. A
final argument allows names to be attached to the rows and/or columns. The names need to be added as a list.
Lists have not been described yet, they will be in the future.
rowlabel <- c("row1", "row2", "row3", "row4")
collabel <- c("column1", "column2", "column3", "column4", "column5", "column6")
mat.labels <- list(row = rowlabel, col = collabel)
dat <- 1:24
mat.byrow.TRUE <- matrix(data = dat, nrow = 4, ncol = 6, byrow = TRUE, dimnames = mat.labels)
mat.byrow.TRUE
col
row column1 column2 column3 column4 column5 column6
row1 1 2 3 4 5 6
row2 7 8 9 10 11 12
row3 13 14 15 16 17 18
row4 19 20 21 22 23 24
To use the row or column names to filter elements from a matrix, use two character vectors whose elements list the names
of the desired rows and columns.
mat.byrow.TRUE[c("row1", "row4"), c("column2", "column3", "column4", "column5")]
col
row column2 column3 column4 column5
row1 2 3 4 5
row4 20 21 22 23
As it was said earlier, a matrix is a vector with an additional dimension attribute. The dimension of a matrix is a vector with
the number of rows followed by the number of columns. Both matrices that were constructed have 4 rows and 6 columns.
dim(mat.byrow.FALSE)
[1] 4 6
dim(mat.byrow.TRUE)
[1] 4 6
Looking at the structure of either of these, you will see the vector of data listed as a vector with the indices running from 1
to 4 for the rows and 1 to 6 for the columns as well as an explicit indication of the dimension.
str(mat.byrow.TRUE)
int [1:4, 1:6] 1 7 13 19 2 8 14 20 3 9 ...
- attr(*, "dimnames")=List of 2
..$ row: chr [1:4] "row1" "row2" "row3" "row4"
..$ col: chr [1:6] "column1" "column2" "column3" "column4" ...
As with a vector, a matrix can only hold one type of data: logical, numeric, or character. From the method of construction,
this should be clear. A matrix is built from a vector, which can only hold one type of data.

25
Arrays
A matrix is a two dimensional example of an array. Arrays can have as many dimensions as you like. However, after three
dimensions, its impossible to picture their structure in higher dimensions. A three dimensional array can be pictured as at
least two matrices stacked on top of each other. With this in mind, each element would be described by its row and
column in one of these matrices. A third value would be needed to describe which matrix in the stack.
Creating an array, is similar to creating a matrix. A vector of data values is needed. This is passed to the () function. A
secondary argument indicates the three dimensions. Each matrix is filled from the first row to the last, starting with
the first column and moving left. Once the first matrix is filled, the second matrix in the stack if filled in the same way. Care
must be taken to make sure the original data is in the desired order.
dat <- c(1:20, 101:120, 1001:1020)
arr <- array(data = dat, dim = c(4, 5, 3))
arr
, , 1

[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

, , 2

[,1] [,2] [,3] [,4] [,5]
[1,] 101 105 109 113 117
[2,] 102 106 110 114 118
[3,] 103 107 111 115 119
[4,] 104 108 112 116 120

, , 3

[,1] [,2] [,3] [,4] [,5]
[1,] 1001 1005 1009 1013 1017
[2,] 1002 1006 1010 1014 1018
[3,] 1003 1007 1011 1015 1019
[4,] 1004 1008 1012 1016 1020
When displayed, each ‘stack’ is indicated by ’ , , stack number’. With enough data, this can become very unmanagable to
look through.
Filtering elements from an array works as you would for a matrix. In three dimensions, you list row, column, and then the
stack. Leaving out a set of indices, indicates that all indices are desired. Depending upon the selection, may return a
vector, a matrix, or an array. Whichever form of data is simplest.
arr[1:2, 2:4, c(1, 3)] # An Array
, , 1

[,1] [,2] [,3]
[1,] 5 9 13
[2,] 6 10 14

, , 2

[,1] [,2] [,3]
[1,] 1005 1009 1013
[2,] 1006 1010 1014
arr[, , 3] # A matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1001 1005 1009 1013 1017
[2,] 1002 1006 1010 1014 1018
[3,] 1003 1007 1011 1015 1019
[4,] 1004 1008 1012 1016 1020
arr[1:3, 4, 2] # A vector
[1] 113 114 115
26
The () and () functions play the same roll they did for matrices. Names can be added, if desired. However, they
could become very difficult to work with.
str(arr)
int [1:4, 1:5, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
dim(arr)
[1] 4 5 3
As with a vector and a matrix, an array can only hold one type of data: logical, numeric, or character.
An example of a four dimensional array is given below. Think of it as at least two three dimensional arrays stacked.
dat1 <- c(1:20, 101:120, 1001:1020)
dat2 <- c(21:40, 221:240, 2021:2040)
arr.4d <- array(data = c(dat1, dat2), dim = c(4, 5, 3, 2))
arr.4d
, , 1, 1

[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

, , 2, 1

[,1] [,2] [,3] [,4] [,5]
[1,] 101 105 109 113 117
[2,] 102 106 110 114 118
[3,] 103 107 111 115 119
[4,] 104 108 112 116 120

, , 3, 1

[,1] [,2] [,3] [,4] [,5]
[1,] 1001 1005 1009 1013 1017
[2,] 1002 1006 1010 1014 1018
[3,] 1003 1007 1011 1015 1019
[4,] 1004 1008 1012 1016 1020

, , 1, 2

[,1] [,2] [,3] [,4] [,5]
[1,] 21 25 29 33 37
[2,] 22 26 30 34 38
[3,] 23 27 31 35 39
[4,] 24 28 32 36 40

, , 2, 2

[,1] [,2] [,3] [,4] [,5]
[1,] 221 225 229 233 237
[2,] 222 226 230 234 238
[3,] 223 227 231 235 239
[4,] 224 228 232 236 240

, , 3, 2

[,1] [,2] [,3] [,4] [,5]
[1,] 2021 2025 2029 2033 2037
[2,] 2022 2026 2030 2034 2038
[3,] 2023 2027 2031 2035 2039
[4,] 2024 2028 2032 2036 2040

27
Chapter 12: Constants Values and Vectors
Several useful values and vectors are built into the base package. These relieve you from needing to type them in
manually, or loading them from outside source.
Pi
The numerical constant can be obtained by using the variable.
pi
[1] 3.141593
Letters
The 26 letters of the Roman alphabet are loaded in either upper or lower case form into two vectors.
letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
Months
The months of the year are also accesible.
month.abb
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.name
[1] "January" "February" "March" "April" "May" "June"
[7] "July" "August" "September" "October" "November" "December"
Sometimes, it is useful to have a larger collections of strings to experiment on. By loading the package you are given
access to several interesting
Fruit, Words, and Sentences
gives access to a character vector containing 80 names of fruit.
library(stringr)
fruit[1:10]
[1] "apple" "apricot" "avocado" "banana" "bell pepper"
[6] "bilberry" "blackberry" "blackcurrant" "blood orange" "blueberry"
gives access to a character vector containing 980 words.
words[1:10]
[1] "a" "able" "about" "absolute" "accept" "account"
[7] "achieve" "across" "act" "active"

28
Sentences
gives access to a character vector containing 980 words.
sentences[1:10]
[1] "The birch canoe slid on the smooth planks."
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."
[4] "These days a chicken leg is a rare dish."
[5] "Rice is often served in round bowls."
[6] "The juice of lemons makes fine punch."
[7] "The box was thrown beside the parked truck."
[8] "The hogs were fed chopped corn and garbage."
[9] "Four hours of steady work faced us."
[10] "Large size in stockings is hard to sell."
A Lorem Ipsum Generator
The __() is a Lorem Ipsum generator included in the package. It will produce a string of dummy
text. Its main purpose is to produce text that is useful for considering page layouts, without being distracted by the actual
content.
library(stringi)
stri_rand_lipsum(n_paragraphs = 3, start_lipsum = TRUE)
[1] "Lorem ipsum dolor sit amet, odio ipsum arcu posuere nec sit tristique. Proin at ac maecenas platea libero, sapien at mi.
Sed orci a enim. Tristique, ac, mus per magna mauris nec nascetur et porta nec faucibus. Id at, nascetur sodales id vitae vel
dapibus. Metus mi, magna nunc eu nulla blandit. Faucibus tristique sapien curae sed hendrerit eu. Maximus, integer ac per
accumsan odio ex id nulla ut vel facilisis. Eros, quisque netus congue aenean ut senectus."
[2] "Mauris blandit in sed ipsum, at leo tincidunt. Primis, luctus facilisi nisl, facilisi taciti curae. Gravida per, amet
purus sem at tempus sociosqu non litora felis inceptos aptent facilisis. Ac donec tellus turpis congue. Risus, a eu in magnis
ultricies nulla nibh. Metus vitae diam amet, curae pulvinar dolor ipsum sit. Erat aliquam eget urna congue magna ut, purus
phasellus dictum euismod. Vitae ante cras vulputate tellus eu. Velit sodales nullam mauris nunc tortor curae in feugiat sit sit
elementum, sed."
[3] "Felis rutrum, odio blandit aenean ultricies, diam! Proin euismod mus sed molestie ac at, pellentesque duis nam nisi. Orci
lectus hendrerit ligula eros accumsan. Tempor sociosqu per fringilla nisl duis vitae vestibulum fusce vel etiam ut porttitor,
sed ad cubilia. Scelerisque donec sed amet placerat, aptent libero non sed. Eros in pretium nec velit, nisi. Commodo quam
commodo dictum aliquam fusce at curabitur habitant. Sed ut nunc sed in massa, rhoncus integer blandit. Dapibus id a elit in. In
bibendum quis curabitur sit lacus ante. Lorem purus condimentum phasellus arcu vehicula mi lobortis commodo dui dui."
The first argument determines the number of paragraphs to generate. The second indicates whether the text should start
with ‘Lorem ipsum dolor sit amet.’
Colors
While not a vector, the () function returns vector containing all the named colors in . When producing graphics,
these names can be used to set the color of a graphical object. There are 657 available to choose from.
clrs <- colors()
clrs[1:20]
[1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
[5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"
[9] "aquamarine1" "aquamarine2" "aquamarine3" "aquamarine4"
[13] "azure" "azure1" "azure2" "azure3"
[17] "azure4" "beige" "bisque" "bisque1"

29
Chapter 13: Infinity and Undefined Values
Some computations result in outcomes that are too large to to store or are undefined mathematically.
Infinity
In situations, the result of a computation is larger (in absolute value) than can be stored. will store this value as ±,
either a positive or negative infinity, depending on the case.
2^9999
[1] Inf
-10^1000
[1] -Inf
While dividing a non-zero value by zero is undefined in some contexts. It is often useful to assign an infinite value to this
result. Depending on the numerator, will indicate this with either a positive or negative infinity.
1/0
[1] Inf
-pi/0
[1] -Inf
As far as is concerned, these infinities are numerical values. Standard arithmetic operations can be performed with them,
and a consistent result will be given.
typeof(Inf)
[1] "double"
Inf + 1
[1] Inf
10/Inf
[1] 0
Testing for an infinity can be done a few ways. The == operator will detect either a positive or negative infinity.
. () will detect either type.
x <- c(1, -1)/0
x == Inf
[1] TRUE FALSE
x == -Inf
[1] FALSE TRUE
is.infinite(x)
[1] TRUE TRUE

30
Not a Number
Some operations that are undefined should not produce a meaningful numeric result. This can occur when you are
subtracting infinite values, or multiplying/dividing infinite values and zero. When the result should not be a meaningful
numeric result, indicates that the result is ‘Not a Number.’
Inf - Inf
[1] NaN
Inf * 0
[1] NaN
0/0
[1] NaN
Inf/Inf
[1] NaN
While is not a meaningful numerical value, will still allow arithmetic operations to be performed on it. The result
will also be .
typeof(NaN)
[1] "double"
NaN + 1
[1] NaN
However, identifying a value can not be done using the usual == double equal sign. Instead, the . () function
must be used.
NaN == NaN
[1] NA
is.nan(NaN)
[1] TRUE
If you want to identify values or missing values , . () can be used.
NaNNA <- c(NaN, NA, 2)
is.na(NaNNA)
[1] TRUE TRUE FALSE

31
Chapter 14: Missing Values & Vectors
When a data value is missing for some reason, it still needs to be recorded that some values should be listed. In , can
be used to indicate a missing data value. It stands for ‘Not Available.’
NA
[1] NA
typeof(NA)
[1] "logical"
As you can see, on its own is actually a third type of logical value. The other two being and . However,
when is used in vector containing some other data type, it will be coerced/changed into the appropriate data type, be it
numeric or character data.
givenVector <- c(1, 2, 3, NA, NA)
typeof(givenVector[4])
[1] "double"
Operations with Missing Values
Performing operations with values can be tricky. The value will infect the operations, and generally lead to a result
of . This is something to be aware of.
givenVector + 1
[1] 2 3 4 NA NA
0 * NA
[1] NA
Many functions will have an logical argument . . If this argument is set to TRUE, missing values will be ignored.
Detecting Missing Values
The == double equal sign operator can’t be used to detect values. It will return a value of .
NA == NA
[1] NA
As a workaround, two functions are provided that can logically check for NA values. . () is a vectorized function that
will check each element of a vector to see if it is and return / for each. () will return a single logical
value. If at least one value is , will be returned.
is.na(NA)
[1] TRUE
is.na(givenVector)
[1] FALSE FALSE FALSE TRUE TRUE
anyNA(givenVector)
[1] TRUE

32
Replacing Missing Values
Certain situation require replacing values with another value. If all values will be replaced with the same value, the
replacement is done with a short line of code.
# Replace the NA values in givenVector with -100.
givenVector
[1] 1 2 3 NA NA
givenVector[is.na(givenVector)] <- -100
givenVector
[1] 1 2 3 -100 -100
The first line of code displays givenVector. The last two elements are . The second line of code does the replacement. As
it is written, it can be tricky to read. . () produces a vector of / values. Placing that vector
inside the square brackets filters the values in givenVector corresponding to TRUE. Finally, the assignment operator
assigns -100 to those two values.
NULL
is used to indicate that a value within a vector is absent. The value is used to indicate the absence of a vector.
Sometimes, will be the result of an undefined computation. Sometimes, will be used to erase some
information.
length(NULL)
[1] 0
typeof(NULL)
[1] "NULL"
Detecting a value is done with the . () function.

33
Chapter 15: Lists
Lists provide another way to store data. Lists differ from vectors in that lists can hold many different types of data.
Making a List
As a basic example, three vectors are created, each of a different type and different length.
numbers <- 1:30
characters <- letters[1:16]
logicals <- rep(c(T, F), times = 2)
To create a list containing these three vectors, use the () function and type in the names of the created vectors. As
creates the list, it actually coerces/changes these vectors into sublists.
newList <- list(numbers, characters, logicals)
newList
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30

[[2]]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"

[[3]]
[1] TRUE FALSE TRUE FALSE
As you examine printed list, you will see two sets of indexes. The double square brackets [[]] indicate which sublist being
displayed. The single bracket [] are used to index the vectors themselves.
List Attributes
Besides the data explicitly contained in a list, other pieces of information can be attached to it. Names can be given to each
sublist. This can be done in two different ways.
Names can be added using the () function. It requires a character vector with a name for each sublist. The names
appearing in the same order as the sublists.
names(newList) <- c("LN", "LC", "LL")
Alternatively, the names can be added as the list is being created. Notice that the [[]] are replaced by dollar sign and a name
with a named list.
namedList <- list(LN = numbers, LC = characters, LL = logicals)
namedList
$LN
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30

$LC
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"

$LL
[1] TRUE FALSE TRUE FALSE
Comments can be added to describe the contents of the list using the comment() function. Comments should be written as
strings. The comment() function displays the comments also.
comment(namedList) <- "This is a comment on a list."
comment(namedList)
[1] "This is a comment on a list."
The attributes() function can be used to display a summary of these additional attributes.
attributes(namedList)
$names
[1] "LN" "LC" "LL"
34

$comment
[1] "This is a comment on a list."
List Structure
When trying to access the information in a list, it is useful to look at its underlying structure using the () function. The
() function will list the names of each component as well as a simplified summary of the information contained in that
component.
str(newList)
List of 3
$ LN: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ LC: chr [1:16] "a" "b" "c" "d" ...
$ LL: logi [1:4] TRUE FALSE TRUE FALSE
str(namedList)
List of 3
$ LN: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ LC: chr [1:16] "a" "b" "c" "d" ...
$ LL: logi [1:4] TRUE FALSE TRUE FALSE
- attr(*, "comment")= chr "This is a comment on a list."
The () function is very useful with extracting information from lists. It can be very easy to extract information in a form
that is different than you desired.
List within Lists
As was said before, the components of a list can be another list. Creating one is done in the same way we have already
created a list.
listOfLists <- list(newList, "New Data")
listOfLists
[[1]]
[[1]]$LN
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30

[[1]]$LC
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"

[[1]]$LL
[1] TRUE FALSE TRUE FALSE


[[2]]
[1] "New Data"
str(listOfLists)
List of 2
$ :List of 3
..$ LN: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
..$ LC: chr [1:16] "a" "b" "c" "d" ...
..$ LL: logi [1:4] TRUE FALSE TRUE FALSE
$ : chr "New Data"
Notice that there are two levels to the this list. It is not that easy to see when displaying the list. However, looking at the
() of the list, we can see that is a list made up of two components. The first component, listed after the first
dollar sign, is a list with three components. The second component, listed after the second dollar sign, is a single piece of
character data.
Vectors & Lists
We have been rather loose with the term vector. In , vectors come in two flavors: atomic vectors and lists. Atomic vectors
are those vectors that contain one type of data. Lists are vectors that can contain many types. We have a special word, lists,
for the second type, it won’t cause a problem referring to atomic vectors as ‘vectors.’

35
Chapter 16: Modifying Lists
Data From Lists
When extracting a components from a list, the result can be preserved as a list, or can be possibly simplified into a
‘simpler’ data type. The choice of which method to use will depend on how the result will be used.
Preservation
To preserve the data as a list, use the name of the list followed by a single set of square brackets. Inside the square
brackets, include a vector with the index or the names of the desired components.
newList[3]
newList["LL"]
$LN
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30
[1] "list"
While this result looks like a numerical vector, it is a list. Operations and functions that perform perfectly well on vectors,
might not work on lists.
namedList[1] + 1
Error in namedList[1] + 1: non-numeric argument to binary operator
If desired, multiple components can be filtered out. This is done using a vector of indexes or a vector of names.
newList[c(1, 3)]
newList[c("LN", "LL")]
$LN
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30

$LL
[1] TRUE FALSE TRUE FALSE
Simplification
Another way to extract the information from a list is to extract and attempt to simplify it. When simplifies a list, if
attempts to coerce the list into a ‘simpler’ form, from point of view. Simplification is attempted either by using double
square brackets or a dollar sign followed by the component name.
newList[[3]]
[1] TRUE FALSE TRUE FALSE
newList$LL
[1] TRUE FALSE TRUE FALSE
typeof(newList[[3]])
[1] "logical"
In either case, for this example, the result is just a logical vector.

36
Modifying List Components
Modifying list components can be done by assigning a new set of values to a particular component. This can be done either
with names or numerical index.
newList[2] <- TRUE
newList$LC <- TRUE
newList[[c(1, 1)]] <- 2
str(newList)
List of 3
$ LN: num [1:30] 2 2 3 4 5 6 7 8 9 10 ...
$ LC: logi TRUE
$ LL: logi [1:4] TRUE FALSE TRUE FALSE
Adding components
Adding Components can be done three ways. Two use the assignment operator. In either of these cases, you assign your
new data to a component that doesn’t already exist. Either refer to the next component index, or assign a new name using
the dollar sign.
newDataIndex <- length(newList)
newList[newDataIndex + 1] <- "New Data"
newList$ExtraData <- "Newer Data"
str(newList)
List of 5
$ LN : num [1:30] 2 2 3 4 5 6 7 8 9 10 ...
$ LC : logi TRUE
$ LL : logi [1:4] TRUE FALSE TRUE FALSE
$ : chr "New Data"
$ ExtraData: chr "Newer Data"
When referring to the next component index, it is important to check the length of the list. Otherwise, several empty list
components can be added.
The third method for adding a component to a list uses the () function. It requires the list you have, and the
components you want at add to it. This method requires that you assign you result to a variable, otherwise the change will
not take place. The previous methods update the existing list.
newList <- append(newList, "Even More New Data")
str(newList)
List of 6
$ LN : num [1:30] 2 2 3 4 5 6 7 8 9 10 ...
$ LC : logi TRUE
$ LL : logi [1:4] TRUE FALSE TRUE FALSE
$ : chr "New Data"
$ ExtraData: chr "Newer Data"
$ : chr "Even More New Data"
Removing Components
To remove a component, assign the value to that component.

37
Chapter 17: DataFrames
When performing data analysis, most likely you will be using a dataframe to hold your data. A dataframe is a list where
each component has the same length. To visualize a dataframe, think of a matrix whose columns can hold different types of
data. The columns correspond to the components of a list. Generally, each column represents a single variable.
Creating a dataframe
Dataframes can be created many ways. Importing data is one way. But, the simplest way to make a dataframe is from a
collection of vectors.
numbers <- 1:26
characters <- letters[1:26]
logicals <- rep(c(T, F))
Using the . () function, list the vectors that you want to include, separated by commas. Names can be included, if
desired.
dat <- data.frame(numbers, charName = characters, 101:126, `2 name` = logicals)
# `2 name` is an incorrectly formatted name. The back-ticks 'allow' it.
head(dat, n = 4) # Change 6 to number of lines you want see.
numbers charName X101.126 X2.name
1 1 a 101 TRUE
2 2 b 102 FALSE
3 3 c 103 TRUE
4 4 d 104 FALSE
Above, you can see the structure of a dataframe. You should notice, that the values in the numbers and characters vectors
each have 26 elements. The dataframe as displayed only shows the first four rows. The ℎ() function was used to display
just the beginning of the dataframe. Entering alone, without using ℎ(), would print all 26 lines. In this case, that is
not much to look at, but consider a data frame with 10,000 rows. The output can be overwhelming. There is a ()
function. It displays the end of a dataframe. The () function will display the number of rows and columns.
If we continue to look at the structure of the dataframe, we will notice some behavior of . ().
str(dat)
'data.frame': 26 obs. of 4 variables:
$ numbers : int 1 2 3 4 5 6 7 8 9 10 ...
$ charName: chr "a" "b" "c" "d" ...
$ X101.126: int 101 102 103 104 105 106 107 108 109 110 ...
$ X2.name : logi TRUE FALSE TRUE FALSE TRUE FALSE ...
First, . () assigns a name to each component that is added. (Remember, these components are each columns.)
The name for each column will come from either
1. the named variable ( numbers )
2. a name given prior to naming the variable ( charName )
3. some convection uses for making up a name, if the first two types of names are not available.
Secondly, notice that . () changed the name of a component. An was appended to `2 name` because it starts
with number. Spaces are replaced with a period. Basically, improperly formatted names are ‘fixed.’ Improperly formatted
names are common in imported data sets.
Finally, if we check the attributes of a dataframe, we can see that besides names, it has two other attributes namely class
and row.names.
attributes(dat)
$names
[1] "numbers" "charName" "X101.126" "X2.name"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
38
Subsetting Dataframes
Elements of a dataframe are identified by both the row and column they are in. Using numerical vectors to select a single
element, identify the row index and the column index. The “d” is in the fourth row and second column of the dataframe dat.
To select “d,” type the name of the dataframe with a set of square brackets. Inside the brackets, enter the row index, a
comma, and the column index.
dat[4, 2]
[1] "d"
To replace “d” with a different value, use the assignment operator. To expand this selection to more values, replace the
single row and column indexes with vectors of multiple indexes.
Subsetting Entire Columns
If you want to restrict yourself to specific columns and take all row, at least two methods exist.
1. In the square brackets, enter a vector of column indexes. (This will return a dataframe.)
2. Replace the square brackets with a single dollar sign, followed by the name of the column. (This will return a vector.)
dat[c(1, 3)] # A comma before vector can be entered
dat$numbers
Subsetting Entire Rows
Subsetting specific rows can be done in several different ways. After the vector name, and surrounded by square brackets
1. enter a numeric vector of desired rows, followed by a comma. (A vector of negative values will remove rows)
2. enter a logical vector followed by a comma. values will be kept.
3. enter a comparison followed by a comma. The comparison is usually based on values in a specific column.
dat[c(1, 3), ]
dat[c(T, F, T, rep(F, times = 10)), ]
dat[dat$charName == "a" | dat$charName == "c", ]
Each of these filters out the same two rows.
numbers charName X101.126 X2.name
1 1 a 101 TRUE
3 3 c 103 TRUE
As with vectors, the () function can also be used. The first argument is the dataframe to subset, the second is the
condition used to select the subset.
Missing Values
Determining which rows of a dataframe have missing values can be important. . () will return a logical
vector where will indicate rows with missing values. This provides a fast way to identify, and possibly remove, rows
with missing values. However, the columns of interest could be complete. In that case, removing rows would be a bad idea.
Extending a dataframe
New columns can be added to a dataframe in several ways:
1. Use . () listing the original dataframe as well as any new columns. Assign this result to a new/existing
variable.
2. Using the dollar sign, add a new name to dataframe and assign the new data to it.
3. Use () to bind a column to the existing data frame. Assign this result to a new/existing variable. `
dat <- data.frame(dat, newNumbers1 = 1:26)
dat$newNumbers2 <- 11:36
dat <- cbind(dat, newNumbers3 = 21:46)
head(dat, n = 1)
numbers charName X101.126 X2.name newNumbers1 newNumbers2 newNumbers3
1 1 a 101 TRUE 1 11 21

39
Chapter 18: Tibbles
A tibble is a modified version of a dataframe1.
Creating a tibble
As with dataframes, tibbles can be created in different ways. They are the result of importing data. They are the result of
using the () function in a manner analogous to using . () to bind vectors together. For comparison, a
tibble and a dataframe are constructed from the same set of vectors. As with a dataframe, list the vectors, possibly with
names, separated by commas2.
numbers <- 1:26
characters <- letters[1:26]
logicals <- rep(c(T, F), times = 13)
# Building a dataframe
dat <- data.frame(numbers, charName = characters, 101:126, `2 name` = logicals)
# Building a tibble.
tib <- tibble(numbers, charName = characters, 101:126, `2 name` = logicals)
A tibble can also be built by directly typing in the values row by row. The is done with the () function. To do so, list
the column names separated by a comma in the first row of entered data. Immediately precede each name with a tilde. One
subsequent lines enter the data row by row separating values with a comma.
tribble.example <- tribble(~var.1, ~var.2, 1, "q", 2, "w", 30, "e")
tribble.example
# A tibble: 3 × 2
var.1 var.2

1 1 q
2 2 w
3 30 e
A tibble can also be made direct from a dataframe, vector, or matrix. The _() function will attempt to coerce these
into a tibble. Going the other direct, the . . () function can be used to coerce a tibble into a data.frame3.
Differences between tibble and dataframe
For comparison, the tibble tib and dataframe dat are displayed below:
tib
# A tibble: 26 × 4
numbers charName `101:126` `2 name`

1 1 a 101 TRUE
2 2 b 102 FALSE
3 3 c 103 TRUE
4 4 d 104 FALSE
5 5 e 105 TRUE
6 6 f 106 FALSE
7 7 g 107 TRUE
8 8 h 108 FALSE
9 9 i 109 TRUE
10 10 j 110 FALSE
# … with 16 more rows
# `2 name` is an incorrectly formatted name. The back-ticks 'allow' it.
head(dat, n = 4)
numbers charName X101.126 X2.name
1 1 a 101 TRUE
2 2 b 102 FALSE
3 3 c 103 TRUE
4 4 d 104 FALSE


1 Most of this informations is drawn from vignette( “tibble” )
2 For tibbles, () will only recycle vectors of length one.
3 Older functions may not recognize tibbles
40
Things to notice about these two:
1. The tibble did not change the name of any of the columns. Spaces were not turned into periods. No was appended
to the start of the improperly named columns.
2. No extra command was needed to restrict the values displayed by the tibble. By default, a tibble will only display 10
rows, and as many columns as will fit nicely on the screen, or page.
3. At the head of each column, the tibble displays the data type.
Something not visible in the printout; the type of data entered into () does not change. In older versions of
. () is would change some data types4.
Subsetting a tibble
Selecting an element, or set of elements, from a tibble follows the same process as selecting from a dataframe with a few
tweaks.
1. If a column has an incorrectly formatted name, the name needs to be surrounded by back tick to use it.
2. When using a $ followed by a name, a data frame does not require the entire name to be included. It will look for a
partial match. A tibble requires the entire name.
dat$num
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
tib$num
Warning: Unknown or uninitialised column: `num`.
NULL
The important difference subsetting a tibble and a dataframe is that a subset of a tibble will always be a tibble. Whereas,
a subset of a dataframe may be a dataframe or it may be a vector.
tribble(~colA, ~colB, "a", 1, "b", 2, "c", 3)
# A tibble: 3 × 2
colA colB

1 a 1
2 b 2
3 c 3
An added behavior
When creating a tibble, one column can refer to another column, and perform operations on it to construct another column.
x <- 11:13 # OUTSIDE of tibble()
z <- 101:103 # OUTSIDE of tibble()
tib <- tibble(x = 1:3, y = x^2, z)
tib
# A tibble: 3 × 3
x y z

1 1 1 101
2 2 4 102
3 3 9 103
Look at the two vector constructed before the tibble. The () function ignores the vector outside of the ()
function in favor of the one created inside. However, there is no vector created inside the () function. So, the vector
outside the () function is used.


4 The latest version of . () does not to this. The argument defaults to FALSE.
57
Chapter 25: ggplot2
2 is a package for building graphics. The gg in the name stands for grammar of graphics.
library(ggplot2)
Each graphic made using 2 will require, at a minimum, four components to be identified.
1. At least one data set
2. The () function to initialize/start up a plot.
3. A set of aesthetic mappings () that will map variables to aesthetic (visual) properties.
4. A set of geometric objects geoms that will each create a layer in the actual graphic. (Wickham 2016)
Other components can be added to further customize the graphic, but these four will be the current focus.
Data
To illustrate the basic parts of a graphic generated by (), a contrived ( made-up ) data set, called linear.data, will be
used. Seeing the data constructed, makes it easier to see the function of the various plotting options.
var.1 <- rep(1:20, each = 3)
var.2 <- 2 * var.1 + rnorm(60)
var.2[c(19, 36)] <- var.2[c(19, 36)] - 5
group.var.1 <- rep(c("A", "B", "C"), times = 20)
group.var.2 <- rep(c("V", "W", "X", "Y", "Z"), each = 12)
linear.data <- data.frame(var.1, var.2, group.var.1, group.var.2)
head(linear.data)
var.1 var.2 group.var.1 group.var.2
1 1 1.038067 A V
2 1 1.707474 B V
3 1 2.258788 C V
4 2 2.847868 A V
5 2 4.195783 B V
6 2 4.030124 C V
The () function is designed to use a tidy data set. This is a data set where each row represents a single observation
and each column a single variable. By design, linear.data is a tidy data set.

. consist of four variables: two numeric, and two categorical. It exhibits a very strong linear trend with the
nineteenth and thirty-sixth observations set relatively far outside the over all linear pattern.

58
ggplot()
On its own, the () function sets up a blank canvas. The grey box is that canvas5.
ggplot() #The basic ggplot window

It should make sense that nothing of interest appeared. In order for a function to deliver any type of information about a
data set, it needs to know about the data set. In this next graph, the data argument in () will be set to linear.data.
Additionally, the graphic will be assigned to a variable6.
basic.plot <- ggplot(data = linear.data) #The basic ggplot window
basic.plot

This returns a grey box, also. Two questions should come to mind:
1. Why did the graph not change?
2. What did () do with the data that was passed to it?
The answer to the first is that nothing in the ( = . ) command indicates how the data should be
represented. Without that information, nothing new can be shown.
To answer the second, the data is stored as a default. The basic structure for creating a graphic is to take the grey box and
add geometric components to it7,
(… ) + ().
These geometric functions, geoms need a data set to produce geometric objects. When they are not supplied with one
directly, they inherit the data set supplied to the () command.
In fact, () has a second argument, mapping, that can also be inherited by subsequent geom functions. Again, this
only occurs if they are not directly supplied as an argument to a subsequent geom.


5 This is the default. It can be changed using themes
6 This makes it easy to execute the graphic at any time, or to add to it.
7 The plus sign + plays a special role when constructing graphics using . Geoms, and other functions, are added to the graphic one at a time using a plus sign.
59
aesthetic Mappings
The mapping argument is set take the results of the aesthetic () function. The () function produces a mapping
between data values and visual properties of a graph. The first argument of () is . This argument identifies which
variable will be mapped to the horizontal axis.8 The second argument identifies which variable is mapped to the vertical
axis. Depending upon the type of graph being built, the argument does not always need to be explicitly provided.
1. Histogram do not require to be stated. The vertical axis of a histogram is a count based on the horizontal data
values.
2. Scatterplots do require to be assigned. A point can’t be plotted without all its components.
If a default mapping is desired between the variable values and a visual property of a graph then additional arguments can
be supplied to (). If a default mapping is not desired for a particular visual property, a mapping can be made outside of
() and inside the geom where it is desired.
# color is assigned to a variable inside aes(). This will map the values of group.var.1 to a set of colors.
aes.plot.1 <- ggplot(data = linear.data, mapping = aes( x = var.1, y = var.2, color = group.var.1))

# No color aesthetic mapping is done at this stage. Color will either be mapped or set once the geometric objects are added on.
aes.plot.2 <- ggplot(data = linear.data, mapping = aes( x = var.1, y = var.2))

# ( aes.plot.1 | aes.plot.2 )

These two graphs now have a coordinate system. This is something that can be assigned once the horizontal and vertical
variables have been assigned. There is still not indication of the data because there is nothing that assigns a geometric
object to the data.
A color aesthetic mapping9 was added left hand graph. This has no visual effect for the same reason. Nothing indicates
what the geometric objects are that will appear in the graphic.
However, looking at aes.plot.1, the color of the objects will have some relationship to the values of group.var.1. This is
because a default mapping is being produced by (). In aes.plot.2, at this point, the color of any geometric object will
have not relationship to any of the given variables. The default color for each geometric object will be used.

The graph on the left is one example of a graph that could be made using aes.plot.1 to start a plot and set some default
aesthetics. The key thing to notice is that the color is not always the same. In fact the colors correspond to values of
group.var.1. The legend indicates the mapping. For the graph on the right, aes.plot.2, black, the default, was used.

8 It is assumed that a cartesian coordinate system is being used. This is the default coordinate system. When a different coordinate system is used, an appropriate mapping of the and
arguments in the () function will be used.
9 Other aesthetics exist. Some are alpha(transparency), fill, group, shape, size, and stroke.
60
geoms
A geom function will add a geometric object, or set of geometric objects, to a plot. Each geom creates a layer that can be
applied to the basic () grid. If no data or aesthetic arguments10 are specified in the geom function, it will inherit the
defaults set in the () function. Otherwise, a geom will override the defaults with those that are specifically specified
within it. Three things should be noticed.
1. Default data and aesthetic mappings do not need to be specified in (). The appropriate arguments can be set
within each geom.
# No default data or aesthetic mappings are set in ggplot()
example.plot.1 <- ggplot( ) +
geom_point(data = linear.data , mapping = aes(x = var.1, y = var.2, color = group.var.1), size=3 )

# No data or aesthetic mappings are given in geom_point()
example.plot.2 <- ggplot( data = linear.data, mapping = aes(x = var.1, y = var.2, color = group.var.1) ) +
geom_point( size=3 )

# (example.plot.1 | example.plot.2)

2. It is not an all or nothing situation. In a specific geom some arguments can be overridden and others can be left to the
default values.
3. The data set used by in a geom can vary from the default. A graphic can be made from multiple data sets.
# outlying.data contains the coordinates of the two outlying data points.
outlying.data
var.1 var.2 group.var.1
19 7 10.22431 A
36 12 19.70552 C
example.plot.3 <- example.plot.2 +
geom_point( data = outlying.data, size = 4, color = "red", shape = 6)

example.plot.4 <- example.plot.2 +
geom_point( data = outlying.data, size = 4, shape = 6)

# (example.plot.3 | example.plot.4)

Notice that in both plots, the outlying.data points are plotted with with an aesthetic triangle shape. The left hand graph
explicitly set the color to red. The default aesthetic color mapping was overridden. No color was specified in the right-hand
plot. The default aesthetic color mapping was used.

10 Not all aesthetics are applicable to all geoms. Check the geoms help documentation.
61
Coordinate Systems
A Cartesian coordinate system is the default for most geoms. This can be overridden by adding a coord function to a plot.
Returning to a simple scatterplot of linear.data on a Cartesian coordinate system we have the following graphs. The
second plot is constructed by using the _() function. There should not be, and there is no, difference
between the two.
example.plot.1 <- basic.plot +
geom_point( mapping = aes(x = var.1, y = var.2), size=3)

example.plot.cartesian <- example.plot.1 +
coord_cartesian()

# ( example.plot.1 | example.plot.cartesian )

The _() function fixes the displayed ratio of units on each axis. Setting its argument to indicates that the
displayed length of one unit on the horizontal axis is times the displayed length of one unit on the vertical. In these
example the ratio is set to 1/2 and 2. Line segments of length five are plotted on each. The ratio of the displayed length can
more easily be seen.
example.plot.fixed.times.two <- example.plot.1 +
coord_fixed(ratio = 2)
example.plot.fixed.times.half <- example.plot.1 +
coord_fixed(ratio = 1/2)
# ( example.plot.fixed.times.two | example.plot.fixed.times.half )

The narrowness of the first graph is a direct result of the fixed ratio. Looking at the range of each variable, the vertical runs
from 0 to 40 approximately, and the horizontal from 0 to 20. Because of the fixed ratio, a displayed vertical distance of 20 is
twice the displayed horizontal displayed distance of 20. Since the vertical range is twice the horizontal range, the graph
would need to be four times taller than it is wide.
The _() function flips the and axis. Nothing exciting about it.

62
The _() function can be used to transform the values on the each axis. This is done by supplying a
transformation to either, or both, of its x or y arguments. Notice that this changes the pattern of the plotted points. This is
because on these scales the distance from zero is proportional to a selected function of a data value.
library(scales)

example.plot.sqrt.x <- example.plot.1 +
coord_trans(x = sqrt_trans() )

example.plot.sqrt.x.log10.y <- example.plot.1 +
coord_trans(x = "reciprocal", y = "log" )

# ( example.plot.1 | example.plot.sqrt.x | example.plot.sqrt.x.log10.y )

Several different transformation functions can be used. These can be found by searching for "_trans" in the help menu.
Taking the portion of their name that appears before "_trans" as a string will allow thier direct use11 However, some will
require additional arguments. _() requires , a transformation exponent. In this case, the transformation
needs to be explicitly written into _() as _( = 2)12
library(scales)
example.polar <- example.plot.1 +
coord_polar()
example.polar.with.arguments <- example.plot.1 +
coord_polar( theta = "x", start = 0, direction = -1 )
# ( example.plot.1 | example.polar | example.polar.with.arguments )

The _() function sets up polar coordinates. The theta argument indicates which position variable will be
mapped to an angle. The remaining position variable will be mapped to a radius. The start argument indicates, in radians,
where all angles are measured from. The default zero angle is 12:00. The direction argument indicates a clockwise (1) or
counter-clockwise (-1) measurement of angles.


11 In the example, “reciprocal” was used in place of _().
12 When accessing the "_trans" function directly, the library needs to be used.
63
Faceting
The faceting functions provide a quick way to create multiple plots each of which focuses on a subset of the data set.
The first graph on the left displays linear.data with the colors determined by group.var.1.
The _() function creates the remaining three graphs. Each graph focuses on data that takes on a single value
from group.var.1. Since each graph contains only one type of data, from group.var.1, and the color is determined by the
type, each graph contains only one color of point.
example.facet.1 <- example.plot.2 +
facet_wrap( ~ group.var.1)

Using the variable group.var.2, the data set if split into 5 subsets. Each subset contains various values from group.var.2.
Therefore, each graph contains multiple colors.
example.facet.2 <- example.plot.2 +
facet_wrap( facets = ~group.var.2)

_() creates graphs in a row, and then wraps down to subsequent rows, if needed.
Pay special attention to the notation for identifying the faceting variable. There is a tilde ~ followed by the name.

64
The _() function is used to form a matrix of plots. Each plot will focus on a subset build from pairing values from
two variables. One variables values will define the rows and the other will define the columns of the matrix plot. The
notation is
row variable ∼ column variable.
example.facet.3 <- example.plot.2 +
facet_grid( group.var.2 ~ group.var.1)

As can be see, care should be used when faceting. If a variable takes too many values, it can lead to an excessively large
number of graphs. Faceting with continuous data could lead to plots with at most single point each. Creating a variable, or
set of variables, that bins the data would be advisable.

65
Scales
Scale functions are used to control the mapping of data values to aesthetics. If a scale function is not specified, a default will
be used. However, using a scale function, or more than one, overrides the defaults. As a first example, a data set is needed.
This contains two numerical vector and one character vector.
x <- c(0, 1, 4, 9, 16)
y <- c(0, 15, 20, 30, 200)
z <- c(letters[1:4], "a")
dat <- data.frame(x, y, z)
Three scatterplots will be created. The first does not explicitly use a scale function. The default mapping will be used. The
remaining two will use a scale function to affect the color of the plotted points.
p1 <- ggplot(data = dat, mapping = aes(x = x, y = y)) +
geom_point(mapping = aes( color = z, shape = z), size =5)
p2 <- p1 + scale_color_manual( values = c("red","blue", "black", "green"))
p3 <- p1 + scale_color_manual( values = c("black","black", "black", "red"))
# (p1|p2|p3)

In the first graph, used a default mapping to assign colors: “a” - orange-ish, “b” - green-ish, “c” - bluegreen-ish, and “d” -
purple. Making use a scale function, namely __(), a different color assignment/mapping was used: “a” -
red, “b” - blue, “c” - black, “d” - green. As for the third graph, it makes the mistake of assigning the same color to three
different characters. Making it impossible to know the actually z-value of the black points. The point being, a different
mapping can be used, but that doesn’t mean it will make the graphic more informative.
Besides modifying a legend, scale functions can change scale of the axes. For reference, the default graph is show again.
p4 <- p1 + scale_x_sqrt()
p5 <- p1 + coord_trans(x = "sqrt")
# (p1|p4|p5)

Looking at the horizontal spacing for the graph on the left, it increases from point to point. For that graph, each value was
essentially mapped, or plotted to itself. When __() was applied to the graph on the right, the horizontal spacing
of the points is very even. This is because each points coordinate was mapped to its square root. However, the labeling
retained the original value. This can also be accomplished using the _() function.
Many scale functions such as __() and __() exist. The naming convention for them follows a
pattern : scale_(aesthetic parameter)_(type of scale). Care should be taken, some are meant for continuous data, some
for discrete.

66
Labelling
In addition to controlling the aesthetic mapping, scale functions can control the labels on the graphs. These are the labels
that appear either on the axes or in the legends. To change labels for a particular aesthetic, select the scale function with
the appropriate aesthetic parameter.
In the center graphs, the labels on the horizontal axis are to be modified. A scale function begining with scale_x_ should
be used. If the vertical axis was to be changed, use a scale_y_ function. As for the legend, color and shape aesthetics have
been used. That means a version of the scale_color_ and scale_shape_ functions should be used.
p6 <- p1 + scale_x_sqrt(name = "X - Axis", breaks = c(0, 1, 4, 9, 16), labels = c(0, 1, 4, 9, 0)) +
scale_color_discrete( name = "ZED", breaks = c("a","b"),labels=c( a = "Apple", b = "BLUE"))
p7 <- p1 +
scale_color_discrete(name = "ZED", breaks = c("a","b"),labels=c( a = "Apple", b = "BLUE")) +
scale_shape_discrete(name = "ZED", breaks = c("a","b"),labels=c( a = "Apple", b = "BLUE"))
# (p1|p6|p7)

Three arguments will cover the main aspects of labeling: name, breaks and labels.
name will set the name of the axis or legend. breaks sets the values that appear on the axis or legend. labels is a character
vector that lists the labels for the break values.
A shorter version for creating labels makes use of the (), () functions. These only set the label for the axis. They do
not set the scale, or label on the axis.
p8 <- p1 + scale_x_sqrt() + xlab(label = "X - Axis") + ylab(label = NULL)

Using the () function the titles, captions, and alt-text for a graphic can be set.
p9 <- p1 + scale_x_sqrt() +
labs( title = "TITLE", subtitle = "A subtitle is below the title", caption = "This is the caption.", alt = "alt.text")


67
Chapter 26: ggplot2: Basic Geometric Objects
() has several geoms for constructing layers that consist of basic geometric objects. First, a few geoms that can build
lines will be examined, then a single data set will be used to construct other geometric objects.
Lines
Constructing a line with an arbitrary intercept and slope is done with the _() function. Its arguments are the
intercept and slope. While _() can be used to build a horizontal line, _ℎ() does the same thing. Only a
y-intercept needs to be provided. A vertical line is made with _(). These three function all build lines. As such,
the line will extend to the edges of the graph. To construct a line segment instead, use _(). The starting (, )
and ending (, ) points must be given.
ggplot() +
geom_abline( mapping = aes( intercept = 0, slope = 1)) +
geom_hline( mapping = aes( yintercept = c(.1, .25, .75) ), color = "red" ) +
geom_vline( mapping = aes( xintercept = c(.05, .7) ), color = "blue", linetype = 2) +
geom_segment( mapping = aes( x = .1, y = .4, xend = .6, yend = .7 ), color = "darkgreen", linetype = 5, size = 4)

Points
The example data set being used consist of two numeric variable and a single character variable. The data was built to
resemble two line segments with different slopes. For all except the last point, the x.data values are increasing. The last
x.data value repeats the first.
x.data <- c(1:10, 10, 10, 11:20, 1)
y.data <- c(1:10, 10, 10, 2 * (11:20), 40)
z.data <- c(rep("A", 10), "B", "B", rep("C", 11))
dat <- data.frame(x.data, y.data, z.data)
Using _() a scatter plot can be built. The character data was mapped to a color aesthetic. As indicated, the red
points and majority of the blue each follow a linear pattern. The slope of the blue being greater than for the red.
ggplot(data = dat) +
geom_point( mapping = aes(x = x.data, y = y.data, color = z.data), size = 2)
point1 <- ggplot(data = dat) +
geom_point( mapping = aes(x = x.data, y = y.data, color = z.data), size = 2)


rect1 <-
ggplot(data = dat) +
geom_rect( mapping = aes( xmin = x.data, xmax = x.data + .5, ymin = y.data, ymax = y.data + 5, fill = z.data), color =
"black")



point1+ plot_spacer()+rect1
The _() can create a scatterplot of rectangles at each data point. The length of the horizonatal edge is controlled
by the xmin, and xmax aesthetics For the vertical, use ymin, and ymax.
ggplot(data = dat) +
geom_rect( mapping = aes( xmin = x.data, xmax = x.data + .5, ymin = y.data, ymax = y.data + 5, fill = z.data), color =
"black")

68
Paths
Using _ℎ() will connect the data points, in the order that they are provided, with straight lines. As can be seen, for
this set of data, the indicated path would not indicate a functional relationship.
ggplot(data = dat) +
geom_path( mapping = aes(x = x.data, y = y.data), size = 2, arrow = arrow(), color = "green")

Similar to ℎ(), _() will connect the data values. However, _() connects the points in an
increasing order for the variable mapped to the aesthetic. The _() only connects points with straight lines. If the
connecting lines are small enough in length, the illusion of a smooth, non-straight, curve curve is achieved. The second
graph is of the cosine function. It was produced with _() using 200 short line segments
curve.1 <- ggplot(data = dat) +
geom_line( mapping = aes(x = x.data, y = y.data), size = 2, arrow = arrow(), color = "green")

finely.spaced.x.data <- seq(from = -pi, to = 2.7*pi, length.out = 201)
finely.spaced.cosine.x <- cos( finely.spaced.x.data )
cos.dat <- data.frame( finely.spaced.x.data , finely.spaced.cosine.x )

curve.2 <- ggplot(data = cos.dat) +
geom_line( mapping = aes(x = finely.spaced.x.data , y = finely.spaced.cosine.x), size = 2, arrow = arrow(), color = "blue")

( curve.1 | curve.2 )

Polygons
The _() function connects the data points in the order that they are given, in a manner similar to
_ℎ(). In the end, _() connects the last data point back to the first, creating a closed shape. The outer
path of a polygon can be colored using the color aesthetic, similar to _ℎ(), but the inner portion can be filled in
using the fill aesthetic.
ggplot(data = dat ) +
geom_polygon( mapping = aes(x = x.data, y = y.data), color = "red", fill = "black", size = 5, alpha = .25)


69
Chapter 27: ggplot2: One Variable
Several different plots can be made using a single variable of data. To make a 2-dimensional plot a second variable is
needed. For the single variable geoms, a statistical function provides the values for the second variable needed. Counts,
densities, and quantiles are computed from the single given variable.
Discrete Data
When looking at a single discrete variable, a bar chart can be an appropriate graphic. This example has a random selection
of flavors stored in a dataframe called favorites.
flavor
1 van
2 van
3 van
4 choc
5 van
6 choc
Additionally, the data was summarized.
tabled.favorites <- data.frame(d = table(favorites$flavor))
tabled.favorites
d.Var1 d.Freq
1 choc 94
2 straw 16
3 van 90
The () function can create a bar chart from either type of data. By default, it expects data that still needs to be
tallied. But, adding stat = “identity” argument indicates the data is already tallied and can be used as is.
bar1 <- ggplot(data = favorites, mapping = aes( x = flavor, fill = flavor)) +
geom_bar()

bar2 <- ggplot(data = tabled.favorites, mapping = aes( y = d.Freq, x=d.Var1, fill = d.Var1)) +
geom_bar(stat = "identity", position ="dodge" )

(bar1 | bar2)

Continuous Data
Several different types of graphics can be made from a single continuous variable. The simulated data here comes from
three different probability distributions. All three types are stacked in a single column of the dataframe dat. A secondary
variable indicates the source distribution for each.
sample.size <- 1000
x.data <- rnorm(n = sample.size, mean = 0, sd = 1)
y.data <- rchisq(n = sample.size, df = 1)
z.data <- runif(n = sample.size, min = 4, max = 6)
data.values <- c(x.data, y.data, z.data)
source.distribution <- rep(c("Normal","Chi-Square","Uniform"), each = sample.size)
dat <- data.frame( data.values, source.distribution)
The _() function will create a frequency polygon, and fill in the enclosed regions with a color. If the fill option is
used as part of the aesthetic mapping, the frequency of each type of data is graphed in the vertical direction. It should be
noticed, that the overall shape of the first two graphs is exactly the same. This is because the accumulated frequencies in
each group are stacked on top of each other. The position argument controls this behavior. Its default setting is “stacked.”
70
To compare the individual distributions in the same graphic, the position argument need to change. Setting it to “identity”
and reducing the transparency (alpha) allows the individual distributions to be seen. The binwidth argument controls the
size of the data groupings.
ggplot(data = dat) +
geom_area( mapping = aes(x = data.values), stat= "bin", binwidth = .5, color = "black", fill = "yellow")
ggplot(data = dat) +
geom_area( mapping = aes(x = data.values, fill = source.distribution), stat= "bin", binwidth = .5, alpha = .5, position =
"stack", show.legend = FALSE, color = "black")

ggplot(data = dat) +
geom_area( mapping = aes(x = data.values, fill = source.distribution), stat= "bin", binwidth = .1, alpha = .5, position =
"identity", show.legend = FALSE)

The _ℎ() function will create a histogram. If the color argument is set inside the aesthetic mapping, The
three distributions will be separated into thier own histograms. Otherwise, the data will be taken as a single set. The
binwidth argument controls the groupings. alpha controls transparency. The _() function creates a
frequency polygon. It has many of the same arguments as _ℎ().
ggplot(data = dat) +
geom_histogram( aes(x = data.values, color = source.distribution ), alpha = .05 , binwidth = .5, position = "identity")
ggplot(data = dat) +
geom_freqpoly( aes(x = data.values, color = source.distribution ), alpha = .95 , binwidth = .5)

Quantile - Quantile plots are created using the () function. () will add a reference line for a selected
distribution. The default is a normal distribution. The distribution and dparams arguments can be used to look at other
distributions. The distribution should be set to a quantile function in the stats package.
ggplot(data = dat, aes(sample = data.values, color = source.distribution )) +
geom_qq( alpha = .2 , distribution = stats::qnorm, dparams = list( mean= 0, sd = 1 )) +
geom_qq_line( distribution = stats::qnorm, dparams = list( mean= 0, sd = 1 ) )


71
Chapter 28: ggplot2: Two Variables
When looking at at least two numerical variables, several options are available for augmenting a scatterplot. The simulated
dataset contains four variables. Two are meant to illustrate a string linear relationship. The third, dist.from.origin, is an
measure of a points distance from the origin. THe final variable is used to break the data points into groups.
x.data <- c(1:10, 3:7, 4:6)
y.data <- 2 * x.data
dist.from.origin <- sqrt(x.data^2 + y.data^2)
dat <- data.frame(x.data, y.data, dist.from.origin, AB = rep(c("A", "B"), times = 9))
head(dat, n = 5)
x.data y.data dist.from.origin AB
1 1 2 2.236068 A
2 2 4 4.472136 B
3 3 6 6.708204 A
4 4 8 8.944272 B
5 5 10 11.180340 A
The first modification of a scatterplot come from using the _() function. The funcion will add a small random
shift to each data point. The can be handing when there multiple observations of the same point. _() can make
the multiplicity of points visible. The width and height arguments put limits on the range of the jitter.
In the simulated data, the points with x-coordinate four, five and six, are repeated three times. In the graph on the left,
there is no indication of this. The graph on the right uses _() makes this repetition visible.
ggplot(data = dat, mapping = aes(x = x.data, y = y.data) ) +
geom_point( mapping = aes( color = dist.from.origin, shape = AB ), size = 4 )

ggplot(data = dat, mapping = aes(x = x.data, y = y.data) ) +
geom_jitter(color = "red", width = 0, height = 1, size = 4, alpha = .5)

As constructed, the simulate data set would not illustrate the rest of these scatterplot modifications very well. A small
amount of randomness will be added to y.data.
dat$y.data <- y.data + rnorm(18, mean = 0, sd = 0.5)
The _() function will adds a set of ticks to each margin of the scatterplot. These are meant to give an indication of
the marginal distribution of each variable. The sides argument controls where the ticks appear. Either on the left, right,
top, or bottom of the graph.
ggplot(data = dat, mapping = aes(x = x.data, y = y.data) ) +
geom_point( mapping = aes( color = dist.from.origin, shape = AB ), size = 2 ) +
geom_rug( sides="bltr" )

72
The _ℎ() function will add a smooth fitted curve. Several methods for producing the smoothed curve are
available: “lm,” “glm,” “gam,” “loess.” In this example, an eight degree polynomial was fit. Additionally, confidence interva ls
can be added using the se argument. The confidence level is set using the level argument.
ggplot(data = dat, mapping = aes(x = x.data, y = y.data) ) +
geom_point( mapping = aes( color = dist.from.origin, shape = AB ), size = 2 ) +
geom_smooth( method = "lm", formula = y ~ poly(x, 8), se = TRUE, color = "red" , level = .99)

The _() function adds an interval around each point. Then ends of adjacent intervals are then linearly
connected to create the ribbon. The width of the ribbon is determined by the ymin and ymax arguments. The xmin and
xmax arguments can be used to control the ribon in the horizontal direction. In this example, the width is dependent on the
distance from the origin. Larger distances translate into a wider ribbon.
ggplot(data = dat, mapping = aes(x = x.data, y = y.data) ) +
geom_point( mapping = aes( color = dist.from.origin, shape = AB ), size = 2 ) +
geom_ribbon(mapping = aes( ymin = y.data-dist.from.origin/2, ymax = y.data+ dist.from.origin/2 ), alpha =.5)

Individual points can be labeled using either the () or the () functions. There are two arguments to
nudge the labels position: nudge_x and nudge_y. () draws a rectangle around the label to make it easier to read.
These functions do not need to label every data point. If provided with a different dataset, a different set of data points, or
only a select few would be labelled.
ggplot(data = dat, mapping = aes(x = x.data, y = y.data) ) +
geom_point( mapping = aes( color = dist.from.origin, shape = AB ), size = 2 ) +
geom_label(mapping = aes( label = AB), nudge_x = .25, nudge_y = .25 )

ggplot(data = dat, mapping = aes(x = x.data, y = y.data) ) +
geom_point( mapping = aes( color = dist.from.origin, shape = AB ), size = 2 ) +
geom_text(mapping = aes( label = AB), color = "red", nudge_x = c(.5,0), nudge_y = c(.5,.5), size = 4 )


程序代写
essay、essay代写