Data Types in R

21 May 2020

“Truth is ever to be found in simplicity, and not in the multiplicity and confusion of things.”—Isaac Newton

I approached R in the same way I would any language. I immediately delve into for-loops, conditional statements, user-defined functions, classes, and so on. I didn’t pay much attention to data types at first - assuming they’re not much different than what I’ve seen already. I found myself using dataframes and matricies often with low confidence and a lingering confusion. I needed to know how these R data structures were related. I finally created these notes for myself to get a grip on the topic. Hopefully you find value in them as well.

The data structures we will cover:

Vectors
Matrices
Arrays
Lists
Data Frames
Factors
Tables

For each data type, we will review the basics of:

Creation
Adding Element
Deleting Elements
Indexing
Filtering
and More

Vector

Introduction

All elements in an R vector must have the same mode: integer, numeric, character, logical, complex, etc.

Creation

x <- c(88, 12, 23, 74)
x

    ## [1] 88 12 23 74

Adding Element

Adding -44 to vector x:

x <- c(x,-44)
x

    ## [1]  88  12  23  74 -44

or:

x[5] <- -44
x

    ## [1]  88  12  23  74 -44

Remove Element

Remove 23 from x:

x <- x[-3]
x

    ## [1]  88  12  74 -44

It’s possible to remove several items at once:

x <- x[-3:-5]
x

    ## [1] 88 12

Indexing

x <- rep(1,10)
x[4] <- 3
x

    ##  [1] 1 1 1 3 1 1 1 1 1 1

x[4]

    ## [1] 3

Filtering

x[6] <- 5
x[9] <- 2
x[x > 2]

    ## [1] 3 5

Combining Vectors

Find the length of a vector with length(x):

When adding two vectors, the lengths of the vectors must be the same or one must be a multiple length of the other. When a vector isn’t long enough to add to another vectors, it will keep repeating itself however many times it needs in order for the lengths to match.

y <- x + x; y

    ##  [1]  2  2  2  6  2 10  2  2  4  2

z <- x + c(1,2,3,4,5); z

    ##  [1] 2 3 4 7 6 6 3 4 6 6

error <- x + c(1,2,3,4); error

    ## Warning in x + c(1, 2, 3, 4): longer object length is not a multiple of
    ## shorter object length

    ##  [1] 2 3 4 7 2 7 4 5 3 3

Matrix

Introduction

A matrix is essentially a vector with two attributes. All the columns in a matrix must have the same mode: integer, numeric, character, logical, complex, etc. in the same way it does for a vector. Matricies are special cases of a more general R type of object: arrays - which we will read about next. Arrays can be multidimensional.

Creation

One way to create a matrix:

y <- matrix(c(1,2,3,4), nrow = 2, ncol = 2)

or simply:

y <- matrix(c(1,2,3,4), nrow = 2)
y

    ##      [,1] [,2]
    ## [1,]    1    3
    ## [2,]    2    4

Using the byrow argument (default = FALSE):

m <- matrix(c(1,2,3,4,5,6), nrow = 2, byrow = T)
m

    ##      [,1] [,2] [,3]
    ## [1,]    1    2    3
    ## [2,]    4    5    6

Adding and Removing Rows and Columns

Rows and columns may be added and deleting from a matrix with operations analogous to the vector operations of adding and deleting. These functions are rbind and cbind.

Adding a column:

ones_column <- matrix(rep(1,2)); ones_column; m

    ##      [,1]
    ## [1,]    1
    ## [2,]    1

    ##      [,1] [,2] [,3]
    ## [1,]    1    2    3
    ## [2,]    4    5    6

cbind(m, ones_column)

    ##      [,1] [,2] [,3] [,4]
    ## [1,]    1    2    3    1
    ## [2,]    4    5    6    1

Adding a row: (don’t forgot to adjust the row number: nrow = 1)

ones_row <- matrix(rep(1,3), nrow = 1); ones_row; m

    ##      [,1] [,2] [,3]
    ## [1,]    1    1    1

    ##      [,1] [,2] [,3]
    ## [1,]    1    2    3
    ## [2,]    4    5    6

rbind(ones_row, m)

    ##      [,1] [,2] [,3]
    ## [1,]    1    1    1
    ## [2,]    1    2    3
    ## [3,]    4    5    6

Rows may be added by creating matricies and copying:

new_matrix <- matrix(nrow = 3, ncol = 3)

addded_row <- matrix(c(7,8,9), nrow = 1)

new_matrix[1:2,1:3] <- m
new_matrix[3,1:3] <- addded_row
new_matrix

    ##      [,1] [,2] [,3]
    ## [1,]    1    2    3
    ## [2,]    4    5    6
    ## [3,]    7    8    9

You can use rbind and cbind to reassign values. This is a form of deleting data.

m <- matrix(1:6, nrow = 3); m

    ##      [,1] [,2]
    ## [1,]    1    4
    ## [2,]    2    5
    ## [3,]    3    6

m <- m[c(1,3),]; m

    ##      [,1] [,2]
    ## [1,]    1    4
    ## [2,]    3    6

Indexing

To retrieve information from a matrix:

m[,2]

    ## [1] 4 6

m[2,]

    ## [1] 3 6

m[2,2]

    ## [1] 6

Values may be changed in a matrix as well:

m[2,2] <- 66; m

    ##      [,1] [,2]
    ## [1,]    1    4
    ## [2,]    3   66

Filtering

x <- matrix(c(1,2,3,2,3,4), nrow = 3, byrow = F); x

    ##      [,1] [,2]
    ## [1,]    1    2
    ## [2,]    2    3
    ## [3,]    3    4

x[x[,2] >= 3]

    ## [1] 2 3 3 4

j <- x[,2] >= 3
x[j,]

    ##      [,1] [,2]
    ## [1,]    2    3
    ## [2,]    3    4

Matrix Math

    ##      [,1] [,2]
    ## [1,]    1    3
    ## [2,]    2    4

Mathematical Matrix Multiplication

y %*% y

    ##      [,1] [,2]
    ## [1,]    7   15
    ## [2,]   10   22

Mathematical Muliplication of Matrix by Scalar

3*y

    ##      [,1] [,2]
    ## [1,]    3    9
    ## [2,]    6   12

Mathematical Matrix Addition

y + y

    ##      [,1] [,2]
    ## [1,]    2    6
    ## [2,]    4    8

Array

Introduction

The mechanics of an array is very similar to that of a matrix in R. Unlike a matrix, an array can represent data in higher than two dimensions. We may build a three-dimensional array by conbining two matricies, we can build four-dimensional arrays by combining two or more three-dimensional arrays, and so on.

List

Introduction

List are unique in that not all elements have to be of the same mode. List structures can combine different types. An R list is similar to a Python dictionary or C struct. List form the foundation for data frames, object oriented programming (R classes), and more.

Creation

If we wanted to create an employee database, we could start with:

j <- list(name = "Eric", salary = 45000, union = T)
j

    ## $name
    ## [1] "Eric"
    ## 
    ## $salary
    ## [1] 45000
    ## 
    ## $union
    ## [1] TRUE

The component names are called tags.

Adding Element

New components can be added after a list is created:

z <- list(a = "abc", b = 12)
z

    ## $a
    ## [1] "abc"
    ## 
    ## $b
    ## [1] 12

z$c <- "sailing" # add a c component
z

    ## $a
    ## [1] "abc"
    ## 
    ## $b
    ## [1] 12
    ## 
    ## $c
    ## [1] "sailing"

Adding component can also be done via a vector index:

z[[4]] <- 28
z[5:7] <- c(F,T,T)
z

    ## $a
    ## [1] "abc"
    ## 
    ## $b
    ## [1] 12
    ## 
    ## $c
    ## [1] "sailing"
    ## 
    ## [[4]]
    ## [1] 28
    ## 
    ## [[5]]
    ## [1] FALSE
    ## 
    ## [[6]]
    ## [1] TRUE
    ## 
    ## [[7]]
    ## [1] TRUE

You can also concatenate lists:

cat <- c(list("Joe", 55000, T), list(5)); cat

    ## [[1]]
    ## [1] "Joe"
    ## 
    ## [[2]]
    ## [1] 55000
    ## 
    ## [[3]]
    ## [1] TRUE
    ## 
    ## [[4]]
    ## [1] 5

Remove Element

You can delete a list component by setting it equal to NULL:

z$b <- NULL
z

    ## $a
    ## [1] "abc"
    ## 
    ## $c
    ## [1] "sailing"
    ## 
    ## [[3]]
    ## [1] 28
    ## 
    ## [[4]]
    ## [1] FALSE
    ## 
    ## [[5]]
    ## [1] TRUE
    ## 
    ## [[6]]
    ## [1] TRUE

Indexing

You can access a list component in several different ways:

j$salary

    ## [1] 45000

j[["salary"]]

    ## [1] 45000

j[[2]]

    ## [1] 45000

What’s the deal with the single and double brackets?

If single brackets are used, the result is another list - a sublist of the original.

j1 <- j[1:2]; j1

    ## $name
    ## [1] "Eric"
    ## 
    ## $salary
    ## [1] 45000

If double brackets are used, it is for referring to a single component and is return in the type of the component.

j[[2]]

    ## [1] 45000

The following returns an error since it’s trying to return several components using a function that is meant to return one:

# j[[1:2]]

Filtering

Accessing list components:

names(j)

    ## [1] "name"   "salary" "union"

We can also get the specific values instead:

ulj <- unlist(j); ulj

    ##    name  salary   union 
    ##  "Eric" "45000"  "TRUE"

Each values above has a name. This name may be removed with the following function:

names(ulj) <- NULL
ulj

    ## [1] "Eric"  "45000" "TRUE"

Using `lapply()` and `sapply()` functions

This applies a specific function on each of the compoenents of a list and returns another list:

lapply(list(1:3,25:29), median)

    ## [[1]]
    ## [1] 2
    ## 
    ## [[2]]
    ## [1] 27

sapply() returns a vector-valued answer:

sapply(list(1:3,25:29), median)

    ## [1]  2 27

Recursive Lists

You can have lists within lists:

b <- list(u = 5, v = 12)
c <- list(w = 13)
a <- list(b, c)
a

    ## [[1]]
    ## [[1]]$u
    ## [1] 5
    ## 
    ## [[1]]$v
    ## [1] 12
    ## 
    ## 
    ## [[2]]
    ## [[2]]$w
    ## [1] 13

TIP: The concatenate function c() has an optional argument recursive, which controls whether flattening occurs when recursive lists are combined.

Dataframe

Introduction

Data frames are similar to a two dimensional matrix in that it contains rows and columns structure. However, data frame are heterogeneous; columns can be different modes. Technically, a data frame is a list whose components are equal-lengthed vectors as the columns of the data frame. Data frame are commonly used when doing data manipulation and other data analysis techniques in R.

Creation

Creating a data frame from scratch:

scientists <- c("Einstein", "Newton")
born <- c(1879, 1642)

d <- data.frame(scientists, born, stringsAsFactors = FALSE)
d

    ##   scientists born
    ## 1   Einstein 1879
    ## 2     Newton 1642

If the named argument stringsAsFactors is not specified, then by default, stringsAsFactors will be TRUE.

Data frames can also be created from external files (.csv, .mtp, .xls, .spss, .txt) using:

mydata = read.csv("mydata.csv", header = TRUE)

mydata = read.mtp("mydata.mtp")  # read from .mtp file

mydata = read.xls("mydata.xls")  # read from first sheet

mydata = read.spss("myfile", to.data.frame=TRUE)

mydata = read.table("mydata.txt")

and many more options.

Adding Element

The rbind() and cbind() matrix functions also work in data frames to add new rows or columns of the same length.

Adding a new row:

d1

    ##   kids ages
    ## 1 jack   12
    ## 2 Jill   10

rbind(d1, list("laura", 19))

    ##    kids ages
    ## 1  jack   12
    ## 2  Jill   10
    ## 3 laura   19

Adding a column

Remove Element

Data deletion in a data frame is similar to that of a vector.

d2

    ##    kids ages
    ## 1  jack   12
    ## 2  Jill   10
    ## 3 laura   19

d2 <- d2[-2,]
d2

    ##    kids ages
    ## 1  jack   12
    ## 3 laura   19

Indexing

d[[1]]

    ## [1] "Einstein" "Newton"

d$scientists

    ## [1] "Einstein" "Newton"

We may also access elements in a matrix-like way we well:

d[,1]

    ## [1] "Einstein" "Newton"

It can be helpful to know the structure of the data frame and is easy to achieve:

str(d)

    ## 'data.frame':    2 obs. of  2 variables:
    ##  $ scientists: chr  "Einstein" "Newton"
    ##  $ born      : num  1879 1642

Filtering

Let’s take a look at how to filter data in a data frame:

cars <- cars[c("mpg", "hp", "wt","cyl")]
head(cars)

    ##                    mpg  hp    wt cyl
    ## Mazda RX4         21.0 110 2.620   6
    ## Mazda RX4 Wag     21.0 110 2.875   6
    ## Datsun 710        22.8  93 2.320   4
    ## Hornet 4 Drive    21.4 110 3.215   6
    ## Hornet Sportabout 18.7 175 3.440   8
    ## Valiant           18.1 105 3.460   6

cars[cars$cyl == 8,]

    ##                      mpg  hp    wt cyl
    ## Hornet Sportabout   18.7 175 3.440   8
    ## Duster 360          14.3 245 3.570   8
    ## Merc 450SE          16.4 180 4.070   8
    ## Merc 450SL          17.3 180 3.730   8
    ## Merc 450SLC         15.2 180 3.780   8
    ## Cadillac Fleetwood  10.4 205 5.250   8
    ## Lincoln Continental 10.4 215 5.424   8
    ## Chrysler Imperial   14.7 230 5.345   8
    ## Dodge Challenger    15.5 150 3.520   8
    ## AMC Javelin         15.2 150 3.435   8
    ## Camaro Z28          13.3 245 3.840   8
    ## Pontiac Firebird    19.2 175 3.845   8
    ## Ford Pantera L      15.8 264 3.170   8
    ## Maserati Bora       15.0 335 3.570   8

cars[,c("mpg", "hp")][cars$wt <= 4,]

    ##                    mpg  hp
    ## Mazda RX4         21.0 110
    ## Mazda RX4 Wag     21.0 110
    ## Datsun 710        22.8  93
    ## Hornet 4 Drive    21.4 110
    ## Hornet Sportabout 18.7 175
    ## Valiant           18.1 105
    ## Duster 360        14.3 245
    ## Merc 240D         24.4  62
    ## Merc 230          22.8  95
    ## Merc 280          19.2 123
    ## Merc 280C         17.8 123
    ## Merc 450SL        17.3 180
    ## Merc 450SLC       15.2 180
    ## Fiat 128          32.4  66
    ## Honda Civic       30.4  52
    ## Toyota Corolla    33.9  65
    ## Toyota Corona     21.5  97
    ## Dodge Challenger  15.5 150
    ## AMC Javelin       15.2 150
    ## Camaro Z28        13.3 245
    ## Pontiac Firebird  19.2 175
    ## Fiat X1-9         27.3  66
    ## Porsche 914-2     26.0  91
    ## Lotus Europa      30.4 113
    ## Ford Pantera L    15.8 264
    ## Ferrari Dino      19.7 175
    ## Maserati Bora     15.0 335
    ## Volvo 142E        21.4 109

Factor

Introduction

The motivation for factors comes from the concept of categorical data in statistics. An R factor may be viewed as a vector with more information added. The extra information consists of a record of the distinct values on that vector, called levels.

Creation

x <- c(5, 12, 13, 12)
xf <- factor(x)
xf

    ## [1] 5  12 13 12
    ## Levels: 5 12 13

The distinct values in xf: 5, 12, and 13 are the levels

str(xf)

    ##  Factor w/ 3 levels "5","12","13": 1 2 3 2

unclass(xf)

    ## [1] 1 2 3 2
    ## attr(,"levels")
    ## [1] "5"  "12" "13"

length(xf)

    ## [1] 4

Adding Element

Future new levels can be anticipated as well:

x <- c(5, 12, 13, 12)
xff <- factor(x, levels = c(5, 12, 13, 88))
xff

    ## [1] 5  12 13 12
    ## Levels: 5 12 13 88

xff[2] <- 88
xff

    ## [1] 5  88 13 12
    ## Levels: 5 12 13 88

Although you cannot add a value that doesn’t have a level associated with it:

xff[2] <- 28

    ## invalid factor level, NA generated

### <span style="color:#E74C3C">Remove Element</span>

### <span style="color:#E74C3C">Indexing</span>

### <span style="color:#E74C3C">Filtering</span>

### <span style="color:#E74C3C">Math</span>

“Truth is ever to be found in simplicity, and not in the multiplicity and confusion of things.”—Isaac Newton

Vector

Introduction

Creation

Adding Element

Remove Element

Indexing

Filtering

Combining Vectors

Matrix

Introduction

Creation

Adding and Removing Rows and Columns

Indexing

Filtering

Matrix Math

Array

Introduction

List

Introduction

Creation

Adding Element

Remove Element

Indexing

Filtering

Using lapply() and sapply() functions

Recursive Lists

Dataframe

Introduction

Creation

Adding Element

Remove Element

Indexing

Filtering

Factor

Introduction

Creation

Adding Element

Table

Introduction

Creation

Adding Element

Remove Element

Indexing

Filtering

Math

Using `lapply()` and `sapply()` functions