Data Types in R
“Truth is ever to be found in simplicity, and not in the multiplicity and confusion of things.”—Isaac Newton
I approached R in the same way I would any language. I immediately delve into for-loops, conditional statements, user-defined functions, classes, and so on. I didn’t pay much attention to data types at first - assuming they’re not much different than what I’ve seen already. I found myself using dataframes and matricies often with low confidence and a lingering confusion. I needed to know how these R data structures were related. I finally created these notes for myself to get a grip on the topic. Hopefully you find value in them as well.
The data structures we will cover:
For each data type, we will review the basics of:
- Creation
- Adding Element
- Deleting Elements
- Indexing
- Filtering
- and More
Vector
Introduction
All elements in an R vector must have the same mode: integer, numeric, character, logical, complex, etc.
Creation
x <- c(88, 12, 23, 74)
x
## [1] 88 12 23 74
Adding Element
Adding -44 to vector x
:
x <- c(x,-44)
x
## [1] 88 12 23 74 -44
or:
x[5] <- -44
x
## [1] 88 12 23 74 -44
Remove Element
Remove 23 from x
:
x <- x[-3]
x
## [1] 88 12 74 -44
It’s possible to remove several items at once:
x <- x[-3:-5]
x
## [1] 88 12
Indexing
x <- rep(1,10)
x[4] <- 3
x
## [1] 1 1 1 3 1 1 1 1 1 1
x[4]
## [1] 3
Filtering
x[6] <- 5
x[9] <- 2
x[x > 2]
## [1] 3 5
Combining Vectors
Find the length of a vector with length(x)
:
When adding two vectors, the lengths of the vectors must be the same or one must be a multiple length of the other. When a vector isn’t long enough to add to another vectors, it will keep repeating itself however many times it needs in order for the lengths to match.
y <- x + x; y
## [1] 2 2 2 6 2 10 2 2 4 2
z <- x + c(1,2,3,4,5); z
## [1] 2 3 4 7 6 6 3 4 6 6
error <- x + c(1,2,3,4); error
## Warning in x + c(1, 2, 3, 4): longer object length is not a multiple of
## shorter object length
## [1] 2 3 4 7 2 7 4 5 3 3
Matrix
Introduction
A matrix is essentially a vector with two attributes. All the columns in a matrix must have the same mode: integer, numeric, character, logical, complex, etc. in the same way it does for a vector. Matricies are special cases of a more general R type of object: arrays - which we will read about next. Arrays can be multidimensional.
Creation
One way to create a matrix:
y <- matrix(c(1,2,3,4), nrow = 2, ncol = 2)
or simply:
y <- matrix(c(1,2,3,4), nrow = 2)
y
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Using the byrow
argument (default = FALSE):
m <- matrix(c(1,2,3,4,5,6), nrow = 2, byrow = T)
m
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
Adding and Removing Rows and Columns
Rows and columns may be added and deleting from a matrix with operations analogous to the vector operations of adding and deleting. These functions are rbind
and cbind
.
Adding a column:
ones_column <- matrix(rep(1,2)); ones_column; m
## [,1]
## [1,] 1
## [2,] 1
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
cbind(m, ones_column)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 1
## [2,] 4 5 6 1
Adding a row: (don’t forgot to adjust the row number: nrow = 1
)
ones_row <- matrix(rep(1,3), nrow = 1); ones_row; m
## [,1] [,2] [,3]
## [1,] 1 1 1
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
rbind(ones_row, m)
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 2 3
## [3,] 4 5 6
Rows may be added by creating matricies and copying:
new_matrix <- matrix(nrow = 3, ncol = 3)
addded_row <- matrix(c(7,8,9), nrow = 1)
new_matrix[1:2,1:3] <- m
new_matrix[3,1:3] <- addded_row
new_matrix
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
You can use rbind
and cbind
to reassign values. This is a form of deleting data.
m <- matrix(1:6, nrow = 3); m
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
m <- m[c(1,3),]; m
## [,1] [,2]
## [1,] 1 4
## [2,] 3 6
Indexing
To retrieve information from a matrix:
m[,2]
## [1] 4 6
m[2,]
## [1] 3 6
m[2,2]
## [1] 6
Values may be changed in a matrix as well:
m[2,2] <- 66; m
## [,1] [,2]
## [1,] 1 4
## [2,] 3 66
Filtering
x <- matrix(c(1,2,3,2,3,4), nrow = 3, byrow = F); x
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
x[x[,2] >= 3]
## [1] 2 3 3 4
j <- x[,2] >= 3
x[j,]
## [,1] [,2]
## [1,] 2 3
## [2,] 3 4
Matrix Math
y
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Mathematical Matrix Multiplication
y %*% y
## [,1] [,2]
## [1,] 7 15
## [2,] 10 22
Mathematical Muliplication of Matrix by Scalar
3*y
## [,1] [,2]
## [1,] 3 9
## [2,] 6 12
Mathematical Matrix Addition
y + y
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
Array
Introduction
The mechanics of an array is very similar to that of a matrix in R. Unlike a matrix, an array can represent data in higher than two dimensions. We may build a three-dimensional array by conbining two matricies, we can build four-dimensional arrays by combining two or more three-dimensional arrays, and so on.
List
Introduction
List are unique in that not all elements have to be of the same mode. List structures can combine different types. An R list is similar to a Python dictionary or C struct. List form the foundation for data frames, object oriented programming (R classes), and more.
Creation
If we wanted to create an employee database, we could start with:
j <- list(name = "Eric", salary = 45000, union = T)
j
## $name
## [1] "Eric"
##
## $salary
## [1] 45000
##
## $union
## [1] TRUE
The component names are called tags.
Adding Element
New components can be added after a list is created:
z <- list(a = "abc", b = 12)
z
## $a
## [1] "abc"
##
## $b
## [1] 12
z$c <- "sailing" # add a c component
z
## $a
## [1] "abc"
##
## $b
## [1] 12
##
## $c
## [1] "sailing"
Adding component can also be done via a vector index:
z[[4]] <- 28
z[5:7] <- c(F,T,T)
z
## $a
## [1] "abc"
##
## $b
## [1] 12
##
## $c
## [1] "sailing"
##
## [[4]]
## [1] 28
##
## [[5]]
## [1] FALSE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
You can also concatenate lists:
cat <- c(list("Joe", 55000, T), list(5)); cat
## [[1]]
## [1] "Joe"
##
## [[2]]
## [1] 55000
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 5
Remove Element
You can delete a list component by setting it equal to NULL
:
z$b <- NULL
z
## $a
## [1] "abc"
##
## $c
## [1] "sailing"
##
## [[3]]
## [1] 28
##
## [[4]]
## [1] FALSE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
Indexing
You can access a list component in several different ways:
j$salary
## [1] 45000
j[["salary"]]
## [1] 45000
j[[2]]
## [1] 45000
What’s the deal with the single and double brackets?
If single brackets are used, the result is another list - a sublist of the original.
j1 <- j[1:2]; j1
## $name
## [1] "Eric"
##
## $salary
## [1] 45000
If double brackets are used, it is for referring to a single component and is return in the type of the component.
j[[2]]
## [1] 45000
The following returns an error since it’s trying to return several components using a function that is meant to return one:
# j[[1:2]]
Filtering
Accessing list components:
names(j)
## [1] "name" "salary" "union"
We can also get the specific values instead:
ulj <- unlist(j); ulj
## name salary union
## "Eric" "45000" "TRUE"
Each values above has a name. This name may be removed with the following function:
names(ulj) <- NULL
ulj
## [1] "Eric" "45000" "TRUE"
Using lapply()
and sapply()
functions
This applies a specific function on each of the compoenents of a list and returns another list:
lapply(list(1:3,25:29), median)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 27
sapply()
returns a vector-valued answer:
sapply(list(1:3,25:29), median)
## [1] 2 27
Recursive Lists
You can have lists within lists:
b <- list(u = 5, v = 12)
c <- list(w = 13)
a <- list(b, c)
a
## [[1]]
## [[1]]$u
## [1] 5
##
## [[1]]$v
## [1] 12
##
##
## [[2]]
## [[2]]$w
## [1] 13
TIP: The concatenate function c() has an optional argument recursive
, which controls whether flattening occurs when recursive lists are combined.
Dataframe
Introduction
Data frames are similar to a two dimensional matrix in that it contains rows and columns structure. However, data frame are heterogeneous; columns can be different modes. Technically, a data frame is a list whose components are equal-lengthed vectors as the columns of the data frame. Data frame are commonly used when doing data manipulation and other data analysis techniques in R.
Creation
Creating a data frame from scratch:
scientists <- c("Einstein", "Newton")
born <- c(1879, 1642)
d <- data.frame(scientists, born, stringsAsFactors = FALSE)
d
## scientists born
## 1 Einstein 1879
## 2 Newton 1642
If the named argument stringsAsFactors
is not specified, then by default, stringsAsFactors
will be TRUE.
Data frames can also be created from external files (.csv, .mtp, .xls, .spss, .txt) using:
mydata = read.csv("mydata.csv", header = TRUE)
mydata = read.mtp("mydata.mtp") # read from .mtp file
mydata = read.xls("mydata.xls") # read from first sheet
mydata = read.spss("myfile", to.data.frame=TRUE)
mydata = read.table("mydata.txt")
and many more options.
Adding Element
The rbind()
and cbind()
matrix functions also work in data frames to add new rows or columns of the same length.
Adding a new row:
d1
## kids ages
## 1 jack 12
## 2 Jill 10
rbind(d1, list("laura", 19))
## kids ages
## 1 jack 12
## 2 Jill 10
## 3 laura 19
Adding a column
Remove Element
Data deletion in a data frame is similar to that of a vector.
d2
## kids ages
## 1 jack 12
## 2 Jill 10
## 3 laura 19
d2 <- d2[-2,]
d2
## kids ages
## 1 jack 12
## 3 laura 19
Indexing
d[[1]]
## [1] "Einstein" "Newton"
d$scientists
## [1] "Einstein" "Newton"
We may also access elements in a matrix-like way we well:
d[,1]
## [1] "Einstein" "Newton"
It can be helpful to know the structure of the data frame and is easy to achieve:
str(d)
## 'data.frame': 2 obs. of 2 variables:
## $ scientists: chr "Einstein" "Newton"
## $ born : num 1879 1642
Filtering
Let’s take a look at how to filter data in a data frame:
cars <- cars[c("mpg", "hp", "wt","cyl")]
head(cars)
## mpg hp wt cyl
## Mazda RX4 21.0 110 2.620 6
## Mazda RX4 Wag 21.0 110 2.875 6
## Datsun 710 22.8 93 2.320 4
## Hornet 4 Drive 21.4 110 3.215 6
## Hornet Sportabout 18.7 175 3.440 8
## Valiant 18.1 105 3.460 6
cars[cars$cyl == 8,]
## mpg hp wt cyl
## Hornet Sportabout 18.7 175 3.440 8
## Duster 360 14.3 245 3.570 8
## Merc 450SE 16.4 180 4.070 8
## Merc 450SL 17.3 180 3.730 8
## Merc 450SLC 15.2 180 3.780 8
## Cadillac Fleetwood 10.4 205 5.250 8
## Lincoln Continental 10.4 215 5.424 8
## Chrysler Imperial 14.7 230 5.345 8
## Dodge Challenger 15.5 150 3.520 8
## AMC Javelin 15.2 150 3.435 8
## Camaro Z28 13.3 245 3.840 8
## Pontiac Firebird 19.2 175 3.845 8
## Ford Pantera L 15.8 264 3.170 8
## Maserati Bora 15.0 335 3.570 8
cars[,c("mpg", "hp")][cars$wt <= 4,]
## mpg hp
## Mazda RX4 21.0 110
## Mazda RX4 Wag 21.0 110
## Datsun 710 22.8 93
## Hornet 4 Drive 21.4 110
## Hornet Sportabout 18.7 175
## Valiant 18.1 105
## Duster 360 14.3 245
## Merc 240D 24.4 62
## Merc 230 22.8 95
## Merc 280 19.2 123
## Merc 280C 17.8 123
## Merc 450SL 17.3 180
## Merc 450SLC 15.2 180
## Fiat 128 32.4 66
## Honda Civic 30.4 52
## Toyota Corolla 33.9 65
## Toyota Corona 21.5 97
## Dodge Challenger 15.5 150
## AMC Javelin 15.2 150
## Camaro Z28 13.3 245
## Pontiac Firebird 19.2 175
## Fiat X1-9 27.3 66
## Porsche 914-2 26.0 91
## Lotus Europa 30.4 113
## Ford Pantera L 15.8 264
## Ferrari Dino 19.7 175
## Maserati Bora 15.0 335
## Volvo 142E 21.4 109
Factor
Introduction
The motivation for factors comes from the concept of categorical data in statistics. An R factor
may be viewed as a vector with more information added. The extra information consists of a record of the distinct values on that vector, called levels.
Creation
x <- c(5, 12, 13, 12)
xf <- factor(x)
xf
## [1] 5 12 13 12
## Levels: 5 12 13
The distinct values in xf: 5, 12, and 13 are the levels
str(xf)
## Factor w/ 3 levels "5","12","13": 1 2 3 2
unclass(xf)
## [1] 1 2 3 2
## attr(,"levels")
## [1] "5" "12" "13"
length(xf)
## [1] 4
Adding Element
Future new levels can be anticipated as well:
x <- c(5, 12, 13, 12)
xff <- factor(x, levels = c(5, 12, 13, 88))
xff
## [1] 5 12 13 12
## Levels: 5 12 13 88
xff[2] <- 88
xff
## [1] 5 88 13 12
## Levels: 5 12 13 88
Although you cannot add a value that doesn’t have a level associated with it:
xff[2] <- 28
## invalid factor level, NA generated
### <span style="color:#E74C3C">Remove Element</span>
### <span style="color:#E74C3C">Indexing</span>
### <span style="color:#E74C3C">Filtering</span>
### <span style="color:#E74C3C">Math</span>