Data structures

October 2019

Objectives

You will learn to:

(re)view some R base
get the different data types: numeric, logical, factor …
understand what is a list, a vector, a data.frame …

no tidyverse, but remember it is built on base

Reminder: arithmetic operations

arithmetic operators

+: addition
-: subtraction
*: multiplication
/: division
^ or **: exponentiation
%%: modulo (remainder after division)
%/%: integer division

Remember

R will:

first perform exponentiation
then multiplications and/or divisions
and finally additions and/or subtractions.

If you need to change the priority during the evaluation, use parentheses – i.e. ( and ) – to group calculations.

9 / 2 # floating division

[1] 4.5

9 %/% 2 # integer division

[1] 4

9 %% 2 # remainder

[1] 1

(1:10 %/% 3) * 3 # int div

 [1] 0 0 3 3 3 6 6 6 9 9

1:10 %% 3 # remainder

 [1] 1 2 0 1 2 0 1 2 0 1

(1:10 %% 3) + (1:10 %/% 3) * 3 # sum up

 [1]  1  2  3  4  5  6  7  8  9 10

Data types and structures

R base

Necessary R base

R base

We could let base down, but the tidyverse is wrapping around it. Some functions need to be known

Advices from David Robinson

Teach tidyverse to beginners
- base: summary functions
- vectorized operations
- stat modeling
- matrices
Don’t teach the hard way

I teach them X just to show them how much easier Y is

teaching programming is hard, don’t make it harder

When you start writing a loop then turn it into dplyr#rstats pic.twitter.com/M0gXUXuYCP
— David Robinson (@drob) 22 Feb 2016

4 main types

Type	Example
numeric	integer (2), double (2.34)
character (strings)	“tidyverse !”
boolean	TRUE / FALSE
complex	2+0i

Special case

NA   # not available, missing data
NA_real_
NA_integer_
NA_character_
NA_complex_
NULL # empty
-Inf/Inf # infinite values
NaN # Not a Number

Specificities

missing versus infinite

median(c(NA_real_, 2.45, 45.67))

[1] NA

median(c(Inf, 2.45, 45.67))

[1] 45.67

is.numeric(c(NA_real_, 2.45, 45.67))

[1] TRUE

is.numeric(c(Inf, 2.45, 45.67))

[1] TRUE

is.infinite(c(NA_real_, 2.45, 45.67))

[1] FALSE FALSE FALSE

is.infinite(c(Inf, 2.45, 45.67))

[1]  TRUE FALSE FALSE

boolean operations

TRUE is 1
FALSE is 0

TRUE + TRUE

[1] 2

# 1 + 1 + 0
TRUE + TRUE + FALSE

[1] 2

45 * FALSE

[1] 0

Structures

Vectors

c() is the function for concatenate

4
c(43, 5.6, 2.90)

[1] 4
[1] 43.0  5.6  2.9

Factors

convert strings to factors, levels is the dictionary

factor(c("AA", "BB", "AA", "CC"))

[1] AA BB AA CC
Levels: AA BB CC

Lists

very important as it can contain anything

list(f = factor(c("AA", "AA")),
     v = c(43, 5.6, 2.90),
     s = 4)

$f
[1] AA AA
Levels: AA

$v
[1] 43.0  5.6  2.9

$s
[1] 4

Matrix (2D), Arrays ($\geq$ 3D)

won’t dig into those

matrix(1:4, nrow = 2)

     [,1] [,2]
[1,]    1    3
[2,]    2    4

Data frames are special lists

`data.frame`

same as list but where all objects must have the same length

Example

data.frame(
  f = factor(c("AA", "AA", "BB")),
  v = c(43, 5.6, 2.90),
  s = rep(4, 3))

   f    v s
1 AA 43.0 4
2 AA  5.6 4
3 BB  2.9 4

Example, missing one element in `v`

data.frame(
  f = factor(c("AA", "AA", "BB")),
  v = c(43, 5.6),
  s = rep(4, 3))

Error in data.frame(f = factor(c("AA", "AA", "BB")), v = c(43, 5.6), s = rep(4, : arguments imply differing number of rows: 3, 2

Data types 2

evaluate

# evaluate
typeof(2)

[1] "double"

class(2)

[1] "numeric"

mode(2)

[1] "numeric"

# check
is.integer(2.34)

[1] FALSE

# check with an actual integer
is.integer(2L)

[1] TRUE

# convert
is.character("2.34")

[1] TRUE

convert (coerce)

as.integer(2.34)

[1] 2

as.character(2.34)

[1] "2.34"

as.numeric("2.34")

[1] 2.34

Vectors

Vectors are the simplest type of object in R.

print(5)

[1] 5

[1] means we made a numeric vector of length 1. Now look at what the : operator does:

1:30

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30

How many elements are in the thing we made here? What does the [24] signify?

Vectors

concatenate

Think of vectors as collections of simple things (like numbers) that are ordered. We can create vectors from other vectors using the c function:

c(2, TRUE, "a string")

[1] "2"        "TRUE"     "a string"

We can use the assignment operator <- to associate a name to our vectors in order to reuse them:

my_vec <- c(3, 4, 1:3)
my_vec

[1] 3 4 1 2 3

Tip

Rstudio has the built-in shortcut Alt+- for <-

Advice

Even if = works also for <-, don’t use it, see why

Vectors

(cont.)

The following will build a character vector. We know this because the elements are all in “quotes”.

char_vec <- c("dog", "cat", "ape")

Now use the c function to combine a length-one vector number of the number 4 with the char_vec. What happens?

c(4, char_vec)

[1] "4"   "dog" "cat" "ape"

Notice that the 4 is quoted. R turned it into a character vector and then combined it with char_vec.

Remember

All elements in a atomic vector must be of the same type. Otherwise, they are silently coerced.

Vectors

hierarchy

source: H. Wickham - R for data science, licence CC

is.vector(char_vec)

[1] TRUE

is.vector(list(a = 1))

[1] TRUE

is.data.frame(list(a = 1))

[1] FALSE

Vectors

built-in

R has a few built in vectors. One of these is LETTERS. What does it contain?

LETTERS

 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

How do we extract the first element from this (the letter A)? Here is how to do it:

LETTERS[1]

[1] "A"

Use the square brackets [] to subset vectors

Vectors

subset

Important

Unlike python or Perl, vectors use 1-based index!!

How to extract > 1 element

select elements from position 3 to 10:

LETTERS[3:10]

[1] "C" "D" "E" "F" "G" "H" "I" "J"

Remember what the `:` operator does?

Take a look:

3:10

[1]  3  4  5  6  7  8  9 10

Can you see how LETTERS[3:10] works now?

Exercise

find a way to output

[1] "B" "C" "D" "E"

find a way to output

[1] "B" "C" "D" "E" "G"

find a way to output first 5 letters + one to the last

[1] "A" "B" "C" "D" "E" "Y"

Tip

the length of a vector is provided by length()

find a way to output all letters except the first one

 [1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
[18] "S" "T" "U" "V" "W" "X" "Y" "Z"

Tip

subsetting could use negative indexes

Solution

indexes from 2 to 5
```
LETTERS[2:5]
```
```
[1] "B" "C" "D" "E"
```

indexes from 2 to 5 + 7

LETTERS[c(2:5, 7)]

[1] "B" "C" "D" "E" "G"

indexes from 1 to 5 + last one

LETTERS[c(1:5, length(LETTERS) - 1)]

[1] "A" "B" "C" "D" "E" "Y"

indexes except 1

LETTERS[-1]

 [1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
[18] "S" "T" "U" "V" "W" "X" "Y" "Z"

Named vectors

names

Like the dict in python or associative array in Perl, characters can be used as indexes

char_vec[1]

[1] "dog"

names(char_vec) <- c("first", "second", "third")
char_vec["first"]

first 
"dog"

char_vec[c("first", "third")]

first third 
"dog" "ape"

char_vec

 first second  third 
 "dog"  "cat"  "ape"

Note

the [1] is no longer displayed

Exercise

create a named vector

assign the LETTERS vector a new name vec
assign the letters vector as names for vec
subset vec for the name "m", we don’t need the indice

solution

vec <- LETTERS
names(vec) <- letters
vec["m"]

  m 
"M"

Vectorized operation

my_vec <- 10:18
my_vec + 2

[1] 12 13 14 15 16 17 18 19 20

R recycles vectors that are too short, without any warnings:

1:10 + c(1, 2)

 [1]  2  4  4  6  6  8  8 10 10 12

my_vec * c(1:3)

What is the ouput?

Vectorized operation

(cont.)

Have a look at the following operation

c(1:3) + c(1:2) * c(1:4)

Warning in c(1:3) + c(1:2) * c(1:4): longer object length is not a multiple
of shorter object length

[1] 2 6 6 9

Details

Steps R performs behind the scene are:

multiplication first, duplicate 2nd vector to reach length 4

c(1, 2, 3) + (c(1, 2, 1, 2) * c(1, 2, 3, 4))

Warning in c(1, 2, 3) + (c(1, 2, 1, 2) * c(1, 2, 3, 4)): longer object
length is not a multiple of shorter object length

[1] 2 6 6 9

add 1st element to first vector to reach length 4
```
c(1, 2, 3, 1) + c(1, 4, 3, 8)
```
```
[1] 2 6 6 9
```

Vectors

tricky filling

x <- numeric(10)
x[20] <- 1
head(x, 20)

 [1]  0  0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA  1

source: Kevin Ushey

Warning!

Unlike python that will output index out of range, R expand and fill with missing values silently

Factors

Vectors with qualitative data

my_f <- factor(c("cytoplasm", "nucleus", "extracellular", "nucleus", "nucleus"))
my_f

[1] cytoplasm     nucleus       extracellular nucleus       nucleus      
Levels: cytoplasm extracellular nucleus

Representation

Actually, data are represented with numbers

str(my_f)

 Factor w/ 3 levels "cytoplasm","extracellular",..: 1 3 2 3 3

Dictionary

ids are called levels. Default is alphabetical sorting

levels(my_f)

[1] "cytoplasm"     "extracellular" "nucleus"

For moving around those levels, safest way is to use the forcats package

Matrix

A matrix is a 2D array

M <- matrix(1:6, ncol = 2, nrow = 3)
M

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

M <- matrix(1:6, ncol = 2, nrow = 3, byrow = TRUE)
M

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

Array

Similar to a matrix but with dimensions $\geq$ 3D

A <- array(1:24, dim = c(2, 4, 3))
A

, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8

, , 2

     [,1] [,2] [,3] [,4]
[1,]    9   11   13   15
[2,]   10   12   14   16

, , 3

     [,1] [,2] [,3] [,4]
[1,]   17   19   21   23
[2,]   18   20   22   24

Lists

Also named recursive vectors. Most permissive type, could contain anything and be nested!

squares are atomic
rounded are lists

source: H. Wickham - R for data science, licence CC

Lists

Pepper analogy

Indexing lists in #rstats. Inspired by the Residence Inn pic.twitter.com/YQ6axb2w7t
— Hadley Wickham (@hadleywickham) 14 septembre 2015

Example

l <- list(name = "Farina",
          firstname = "Geoff",
          year = 1995)

l["firstname"]

$firstname
[1] "Geoff"

l[["firstname"]]

[1] "Geoff"

Question

How to subset a single pepper seed?

Data frames

It’s the most important type to recall. All the tidyverse is focusing on those.

Actually on tweaked data.frame: tibbles

definition

data.frame are lists where all columns (i.e vectors) are of the same length

built-in example

women

   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

Data frames

subset

We can extract a vector (column) from a data frame in a few different ways:

Using the double `[[]]`

women[["height"]]

 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

Or its alias: the `$` operator

women$height

 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

Remember the pepper analogy introduced by Hadley?

What would be the output of women["height"]?

Data frame as a table

A data frame can be considered as a table and extract a specify a cell by its row and column:

first 5 rows

head(women, 5)

  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126

only one cell with `[]`

first coordinate = row
second coordinate = col

women[4, 2]

[1] 123

Logical operators

In addition to the arithmetic operators

Perform comparisons

== equal
!= different
< smaller
<= smaller or equal
> greater
>= greater or equal
! is not
&, && and
|, || or

Iterations

rationale

computers are fast at repeating things
let’s computers do the job, focus on action

vec <- c(1, 5, 7) # Example add 1 to each element of vec

for loop

res <- vector("numeric", length = length(vec))
for (i in seq_along(vec)) {
  res[i] <- vec[i] + 1
}
res

[1] 2 6 8

`purrr`

map_dbl(vec, ~ .x + 1)

[1] 2 6 8

vectorization: amazing R’s feature

vec + 1

[1] 2 6 8

Exporting data

text files

write_tsv(mtcars, here::here("results", "mtcars_file.tsv"))

file data/mtcars_file.tsv

data/mtcars_file.tsv: cannot open `data/mtcars_file.tsv' (No such file or directory)

write binary objets on disk

write_rds(mtcars, here::here("results", "mtcars_object.rds"))

file data/mtcars_object.rds

data/mtcars_object.rds: cannot open `data/mtcars_object.rds' (No such file or directory)

Exporting complex objects

write binary objets on disk

mpg_wt <- lm(mpg ~ wt, data = mtcars)
mpg_wt

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344

typeof(mpg_wt)

[1] "list"

write_rds(mpg_wt, "data/mpg_wt.rds")

read binary objets on disk

mt_lm <- read_rds("data/mpg_wt.rds")
mt_lm

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344

identical(mpg_wt, mt_lm)

[1] TRUE

Wrap up

You learned to:

data types
- initialize
- coerce
data structures
- inter-connections
- sub-setting
- vectorization
export data
- text files
- binary files

Next step is to learn programming!

Objectives

You will learn to:

Reminder: arithmetic operations

arithmetic operators

Remember

Data types and structures

R base

Necessary R base

R base

Advices from David Robinson

4 main types

Special case

Specificities

missing versus infinite

boolean operations

Structures

Vectors

Factors

Lists

Matrix (2D), Arrays (\(\geq\) 3D)

Data frames are special lists

data.frame

Example

Example, missing one element in v

Data types 2

evaluate

convert (coerce)

Vectors

Vectors

Vectors

concatenate

Tip

Advice

Vectors

(cont.)

Remember

Vectors

hierarchy

Vectors

built-in

Vectors

subset

Important

How to extract > 1 element

Remember what the : operator does?

Exercise

find a way to output

find a way to output

find a way to output first 5 letters + one to the last

Tip

find a way to output all letters except the first one

Tip

Solution

Named vectors

names

Note

Exercise

create a named vector

solution

Vectorized operation

What is the ouput?

Vectorized operation

(cont.)

Have a look at the following operation

Details

Vectors

tricky filling

Warning!

Factors

Representation

Dictionary

Matrix

Array

Lists

Lists

Pepper analogy

Example

Question

Data frames

definition

`data.frame`

Example, missing one element in `v`

Remember what the `:` operator does?

Using the double `[[]]`

Or its alias: the `$` operator

only one cell with `[]`

`purrr`