DSME6040G: Business Analytics
and Digital Innovation
Lecture 2
Data Structure
Review: Vectors
• A vector stores an ordered set of values called elements.
• A vector can contain any number of elements, but all of the
elements must be the same type of values
– E.g., a vector cannot contain both numbers and text
– To determine the type of vector v, use the command typeof(v)
– Commonly used types:
• integer (numbers without decimals, e.g., 1, 2, 3.)
• double (numbers with decimals, e.g., 1.1, 2.2.)
• character (text data)
• logical (TRUE or FALSE values)
• NULL (indicating the absence of any value)
• NA (indicating a missing value)
• The vector provides the foundation for many other R data
structures.
2
Outline
• R Data Structure
– … (Vectors)
– Factors
– Lists
– Data Frames
• Saving, Loading, and Removing
• R Programming
3
Factors and Levels
4
Factors and Levels
• A factor is a special type of vector, normally used
to hold categorical or ordinal variables.
• Why not use character vectors?
– Category labels are stored only once:
• Character: store Pen, Pencil, Pen
• Factor: store 1, 2, 1 (hence reduce the size of memory)
– Many algorithms treat nominal and numeric data
differently
• Coding as factors is often needed to inform an R function to
treat categorical data appropriately
• The levels variable comprise the set of possible
categories factor could take (e.g., Pen or Pencil)
5
Factors and Levels
• Factors and levels
– X <- c(”AB", "A", "B", "A")
– (Xf <- factor(X))
– Levels are ordered alphabetically by default;
– str(Xf) # Internally factors are represented by
integers 1, 2, 3, …
– levels(Xf)
– table(Xf) # table(X) also works
• One can anticipate future new levels, one cannot
sneak in an “illegal” level
– (Xf <- factor(X, levels = c("A", "B", "AB",
"O"))))
– Xf[2] <- "O" # works
– Xf[2] <- "C" # warning
6
Ordinal Data (Ordered Factors)
• Indicate the presence of ordinal data by
setting the ordered parameter to TRUE
• Compare:
– (quality1 <- factor(c("Average", "Good", "Good",
"Bad"), levels = c("Bad", "Average", "Good")))
– (quality2 <- factor(c("Average", "Good", "Good",
"Bad"), levels = c("Bad", "Average", "Good"),
ordered = TRUE))
• Logical tests work for ordered factors
– quality1 > "Average” # error
– quality2 > "Average” # works
7
Differences Between Nominal and
Numerical Data
• Try:
– xNum <- c(1, 2, 3, 1)
– yNum <- as.factor(xNum)
– xNum[2]-xNum[1]
– yNum[2]-yNum[1]
– summary(xNum)
– summary(yNum)
8
Outline
• R Data Structure
– … (Vectors)
– Factors
– Lists
– Data Frames
• Saving, Loading, and Removing
• R Programming
9
Lists
• Like a vector, a list is used for storing an ordered
set of elements.
• Different from a vector, a list allows different
types of elements.
• Due to this flexibility, lists are often used to store
various types of input and output and sets of
configuration parameters
– For example, it can be used to record a customer’s
information: name, id, income, membership, ...
10
Lists
• Creating a list
– (course1 <- list(dept = "DSME",
course_id = "6040G",
semester = factor(3, levels =
c(1, 2, 3)),
c_type = factor("elective",
levels = c("elective", "required"), ordered = TRUE)))
• Each component is almost always given a name
(called tags)
– The names are not technically required
– Useful for indexing (by name rather than by numbered position)
11
Indexing
• Single-bracket: returns another list object
– course1[3]
– course1["semester"]
– class(course1[3])
• Double-bracket: returns a single list item in its native
data type
– course1[[3]]
– course1[["semester"]]
– class(course1[[3]])
• Or appending a $ and the value’s name
– course1$semester
– class(course1$semester)
• It is generally considered clearer and less error-
prone to use names instead of numeric indices.
12
General list operations
• Add elements: new components/elements can be
added after a list is created
– course1$Exam <- ”two" # or
course1[[”Exam"]]
– course1[6:7] <- c(T, F)
• Delete elements (different from a vector):
– course1[5:7] <- NULL
• Concatenate:
– c(list1, list2)
• List as a component of a list:
– list(list1, list2)
• To obtain the values, use unlist()
13
lapply() and sapply()
• List apply: lapply()
– lapply(aList, function) # aList can be a
vector coerced to a list
– It calls the specified function on each component/element
of a list and returns another list
• Simplified apply: sapply()
– If the list returned by lapply() can be simplified to a vector
or a matrix
– Using sapply(), rather than applying the function directly,
gives one the desired matrix form in the output
14
Outline
• R Data Structure
– … (Vectors)
– Factors
– Lists
– Data Frames
• Saving, Loading, and Removing
• R Programming
15
Spreadsheet Format
16
Data Frame
• A data frame is a list of matched column vectors/factors.
• It is the most used data structure: spreadsheet tables (with
both rows and columns)
– Columns: variables (or features or attributes)
– Rows: observations (or examples)
• Create a data frame
– a.data.frame <- data.frame(vector1, vector2,
stringsAsFactors=TRUE)
– If one does not specify stringsAsFactors=TRUE, R will keep
the character vector (hence not a factor)
• The default was TRUE previously but has been changed to FALSE for R 4.0.0+.
• If vector1 and vector2 do not have the same length,
recycling may not work.
17
Index and Extract Sub data Frame
• Note that a data frame is simply a list of vectors
• Indexing and extract sub data frame:
– As a list (still a data frame):
• a.df$item1
• a.df[["item2"]]
• a.df[[3]]
– As a two-dimensional data structure:
• a.df[ , 1]
• a.df[2:5, ]
• More on NA
– subset(aDataFrame, variable1 >= a) # Note that
for the condition, we do not need to write this:
aDataFrame$variable1 >= a
– complete.cases(aDataFrame): logical
18
More Operations
• Adding a column:
– a.df$newvar <- a new vector
– a.df[["newvar"]] <- a new vector
• Deleting a column:
– One can set NULL to a column to delete a column (variable)
• Generate new data frame from existing ones
– rbind(): adding observations
– cbind(): adding variables
• Merging data frames
– merge(x,y)
– merge(x, y, by.x = "x.item", by.y = "y.item")
– Duplicate matches will appear in full in the result, possibly in
undesirable ways.
19
apply(), lapply() and sapply()
• apply(): one can use apply() if the columns
are all of the same type.
• List apply: lapply()
– lapply(aList, function) # aList can be a
vector coerced to a list
– It calls the specified function on each component of a list and
returns another list
• Simplified apply: sapply()
– If the list returned by lapply() can be simplified to a vector
or a matrix
– Using sapply(), rather than applying the function directly,
gave one the desired matrix form in the output
20
Summary
• Essentially, we learn the following
– Vectors
• A factor is a special vector
– Lists
• A data frame is a special list
• The difference between vectors and lists is
– All elements in vectors should be the same type
– What are the types:
• Integers, double, character, logical, NULL and NA
21
Outline
• R Data Structure
– … (Vectors)
– Factors
– Lists
– Data Frames
• Saving, Loading, and Removing
• R Programming
22
Saving, Loading, and Removing
R Data Structure
• Save:
– save(student_name, class_final, file =
"mydata.RData")
– It is saved to the working directory
• Load:
– load("mydata.RData")
• Remove:
– rm(student_name)
• You can use save.image() to write the entire
session to a file simply called .RData.
– R will load this file the next time you start R
23
Read Tables (and especially CSV files)
• Use read.table() or read.csv() to read data into R
– Data is stored in a format referred to as data frame
– header=T tells R that the first line contains the variable names
– na.strings tells R that a particular character or set of characters
should be treated as a missing element.
– If one does not specify stringsAsFactors=TRUE, R will keep the
character vector (hence not a factor)
• The default was TRUE previously but has been changed to FALSE for R 4.0.0+.
• Example:
– read data from the Internet
• theURL <- "http://www.jaredlander.com/data/Tomato%20First.csv"
• tomato <- read.table(file=theURL, header=TRUE, na.strings="?",
sep=",")
• tomato2 <- read.csv(file=theURL, header=TRUE, na.strings="?")
• head(tomato)
• head(tomato2)
• identical(tomato, tomato2)
24
Outline
• R Data Structure
– … (Vectors)
– Factors
– Lists
– Data Frames
• Saving, Loading, and Removing
• R Programming
25
R Programming
• Blocks are delineated by braces, though
braces are optional if the block consists of just
a single statement. Statements are separated
by newline characters or optionally by
semicolons
• Loops:
– for (n in x) { statements }
• x must be a vector
– while (condition) { statements }
• Initial condition, condition update
26
if-else and ifelse
• if-else statement
if (condition) {
Do_sth
} else {
Do_sth
}
• When working with vectors, ifelse(condition,
outcome1, outcome2) may produce faster code
27
Functions
• Write your own functions
– halfmedian <- function(x) {median(x)/2}
– halfmedian <- function(x) {y <- median(x)
z <- y/2
return(z)}
• Anonymous functions (lambda expressions)
– Can substitute for a general expression and does
not need to be declared separately as a named
function
• sapply(aDataFrame, function(x)
{median(x)/2})
28