Tidy data

September 2019

Tidyverse

workflow

Pipeline

Tidyverse, packages in processes

https://www.tidyverse.org/

Tidyverse components

core / extended

Core

ggplot2, for data visualization
dplyr, for data manipulation
tidyr, for data tidying
readr, for data import
purrr, for functional programming
tibble, for tibbles, a modern re-imagining of data frames

source: http://tidyverse.tidyverse.org/. H.Wickham

Extended

Working with specific types of vectors:
- hms, for times
- stringr, for strings
- lubridate, for date/times
- forcats, for factors
Importing other types of data:
- feather, for sharing data
- haven, for SPSS, SAS and Stata files
- httr, for web apis
- jsonlite for JSON
- readxl, for .xls and .xlsx files
- rvest, for web scraping
- xml2, for XML files
Modelling
- modelr, for modelling within a pipeline
- broom, for models -> tidy data

Tidyverse criticism

dialect

@ucfagls yeah. I think the tidyverse is a dialect. But its accent isn’t so thick
— Hadley Wickham (@hadleywickham) 12 janvier 2017

Tidyverse criticism

controversy

In StackOverflow’s comment

See the popularity of the data.table versus dplyr question.

Easily summarized

data.table is faster, for less than 10 m rows, negligible.

Tidyverse criticism

jobs

Realized today: #tidyverse R and base #rstats have little in common. Beware when looking for job which requires knowledge of R.
— Yeedle N. (@Yeedle) 2 mars 2017

Personal complains

still young so change quickly and drastically. Backward compatibility is not always maintained.
tibbles are nice but a lot of non-tidyverse packages require matrices. rownames still an issue.

Anyway, learning the tidyverse does not prevent to learn R base, it helps to get things done early in the process

Tidyverse

trends

source: rdocumentation (2017/04/18)

Tidyverse

trends

source: rdocumentation (2017/04/18)

The dream team

Pipes with magrittr

developed by Stefan Milton Bache

compare the approaches

classic parenthesis syntax
magrittr pipeline

R base

set.seed(12)
round(mean(rnorm(5)), 2)

[1] -0.76

magrittr

set.seed(12)
rnorm(5) %>%
  mean() %>%
  round(2)

[1] -0.76

Of note, magrittr needs to loaded with either:

library(magrittr)
library(dplyr)
library(tidyverse)

Règles de la tuyauterie

Bob Rudis (hrbrmstr)

source from studio::conf17

tidy data

Learning objectives

Definitions

Principles of tidy data to structure data
Find errors in existing data sets
Structure data
Reshaping data with tidyr

Comments

Cleaning data also requires dplyr
tidyr and dplyr are intertwined
Focus on “tidy data”
Introduction of tidyr ways

Basic: everything in the data rectangle

with header

Jenny Bryan in Jeff Leek blog post

Rationale

All happy families are alike; each unhappy family is unhappy in its own way. Leo Tolstoy, Anna Karenina

Semantics

Definitions

Variable: A quantity, quality, or property that you can measure.
Observation: A set of values that display the relationship between variables. To be an observation, values need to be measured under similar conditions, usually measured on the same observational unit at the same time.
Value: The state of a variable that you observe when you measure it.

source: Garret Grolemund and vignette("tidy-data")

Definition

Tidy data

Each variable is in its own column
Each observation is in its own row
Each value is in its own cell

Bad data exercise

online

The following table lists missense variants in a gene in a group of patients
What’s wrong with this Excel sheet?
Which problems are tidy issues

Tidy errors

Error	Tidy violation	Comment
Patient names	No	Data protection violation
Identical column names	Yes	Variable error
Inconsistent variables names	No	Bad practice
Non-English columns names	No	Bad practice
Color coding	No	The horror, the horror
Inconsistent dates	No	Use ISO8601
Multiple columns for one item	Yes	One observation per line
Redundant information	Yes	Each variable is in its own column
Repeated rows	Yes	Each observation is in its own row
Uncoded syndromes	Yes/No	Each value in its own cell
Unnecessary information	No	like birthdate, comments: bad practice
Name of the table	No	You’ll see this often

Data cleaning exercise

offline

Clean the “bad table”

Bring data into shape such that it conforms to tidy data requirements
Pay attention to details of format, less to actual data
Do not use R for doing the manipulations

10 min, discuss

Common tidy data violations

Problems

Column headers are values, not variable names (gather)
Multiple variables stored in one column (separate)
Variables are stored in both rows and columns (gather-spread)
Repeated observations (nest or table)
Multiple types in one table (dplyr data transformation)
One type in multiple tables (dplyr, combine into single table)

introduction: cheat sheets

Convert Long / wide format

The wide format is generally untidy but found in the majority of datasets

Demo with mtcars

count

count(mtcars, am, cyl)

# A tibble: 6 x 3
     am   cyl     n
  <dbl> <dbl> <int>
1     0     4     3
2     0     6     4
3     0     8    12
4     1     4     8
5     1     6     3
6     1     8     2

pivot wider

count(mtcars, am, cyl) %>%
  pivot_wider(names_from = cyl, 
              values_from = n) -> wcars
wcars

# A tibble: 2 x 4
     am   `4`   `6`   `8`
  <dbl> <int> <int> <int>
1     0     3     4    12
2     1     8     3     2

pivot longer

wcars %>% 
  pivot_longer(cols = -am,
               names_to = "cyl",
               values_to = "n",
  # optional prototype to get integers (not chr)
               names_ptypes = list(cyl = integer()))

# A tibble: 6 x 3
     am   cyl     n
  <dbl> <int> <int>
1     0     4     3
2     0     6     4
3     0     8    12
4     1     4     8
5     1     6     3
6     1     8     2

Animation

by Garrick Aden-Buie

image from Garrick Aden-Buie, github repo

Demo with the fish dataset by Myfanwy Johnson

pivot_wider

head(fish_encounters, 3)

# A tibble: 3 x 3
  fish  station  seen
  <fct> <fct>   <int>
1 4842  Release     1
2 4842  I80_1       1
3 4842  Lisbon      1

dim(fish_encounters)

[1] 114   3

library(tidyverse)
fish_wide <- fish_encounters %>%
  pivot_wider(
    names_from = station,
    values_from = seen
  )
fish_wide

# A tibble: 19 x 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1
 4 4845        1     1      1     1       1    NA    NA    NA    NA    NA
 5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA
 6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA
 7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA
 8 4850        1     1     NA     1       1     1     1    NA    NA    NA
 9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA
10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA
11 4855        1     1      1     1       1    NA    NA    NA    NA    NA
12 4857        1     1      1     1       1     1     1     1     1    NA
13 4858        1     1      1     1       1     1     1     1     1     1
14 4859        1     1      1     1       1    NA    NA    NA    NA    NA
15 4861        1     1      1     1       1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1    NA
17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA
18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA
19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA
# … with 1 more variable: MAW <int>

Demo with the fish dataset

pivot_longer

fish_wide %>% 
  pivot_longer(
    cols = -fish,
    names_to = "station",
    values_to = "seen"
  )

# A tibble: 209 x 3
   fish  station  seen
   <fct> <chr>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
# … with 199 more rows

Warning

Note that we get more rows than in the original dataset, as missing combination are now NA

Why tidy is useful?

iris %>%
  pivot_longer(cols = -Species,
               names_to = "flower", 
               values_to = "measure") %>%
  ggplot() +
  geom_boxplot(aes(x = Species, y = measure, fill = flower))

separate / unite

demo_tibble <- tibble(year  = c(2015, 2014, 2014),
                      month = c(11L, 2L, 4L),    # create a vector of integers
                      day   = c(23, 1, 30),      # default is double
                      value = c("high", "low", "low"))
demo_tibble

# A tibble: 3 x 4
   year month   day value
  <dbl> <int> <dbl> <chr>
1  2015    11    23 high 
2  2014     2     1 low  
3  2014     4    30 low

unite

demo_tibble %>%
  unite(date, c(year, month, day),
        sep = "-") -> demo_tibble_unite
demo_tibble_unite

# A tibble: 3 x 2
  date       value
  <chr>      <chr>
1 2015-11-23 high 
2 2014-2-1   low  
3 2014-4-30  low

separate

use quotes since we are not refering to objects

demo_tibble_unite %>%
  separate(date, c("year", "month", "day"))

# A tibble: 3 x 4
  year  month day   value
  <chr> <chr> <chr> <chr>
1 2015  11    23    high 
2 2014  2     1     low  
3 2014  4     30    low

Basic data cleaning

Separate rows

from ugly tables

Multiple values per cell

patient_df <- tibble(
    subject_id = 1001:1003, 
    visit_id = c("1,2,3", "1,2", "1"),
    measured = c("9,0, 11", "11, 3" , "12")  )
patient_df

# A tibble: 3 x 3
  subject_id visit_id measured
       <int> <chr>    <chr>   
1       1001 1,2,3    9,0, 11 
2       1002 1,2      11, 3   
3       1003 1        12

Note the incoherent white space

Combinations of variables

separate_rows(patient_df,
              visit_id, measured,
              convert = TRUE) # chr -> int

# A tibble: 6 x 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1003        1       12

Comment

To split single variables use separate

`separate()` and `unite()`

exercice

create valid dates format YYYY-MM-DD

dummy <- data_frame(year = c(2015, 2014, 2014),
                    month = c(11, 2, 4),
                    day = c(23, 1, 30),
                    value = c("high", "low", "low"))

Warning: `data_frame()` is deprecated, use `tibble()`.
This warning is displayed once per session.

dummy

# A tibble: 3 x 4
   year month   day value
  <dbl> <dbl> <dbl> <chr>
1  2015    11    23 high 
2  2014     2     1 low  
3  2014     4    30 low

solution `unite()`

dummy_unite <- unite(dummy, date,
                     year, month, day,
                     sep = "-")
dummy_unite

# A tibble: 3 x 2
  date       value
  <chr>      <chr>
1 2015-11-23 high 
2 2014-2-1   low  
3 2014-4-30  low

`separate()` and `unite()`

exercice

explod YYYY-MM-DD by the ‘-’

solution `separate()`

Use quotes since we are not refering to objects
Default split on non-alphanumeric characters

separate(dummy_unite, 
         date, c("year", "month", "day"))

# A tibble: 3 x 4
  year  month day   value
  <chr> <chr> <chr> <chr>
1 2015  11    23    high 
2 2014  2     1     low  
3 2014  4     30    low

fill all combinations 1/3

kelpdf <- data.frame(
  Year = c(1999, 2000, 2004, 1999, 2004),
  Taxon = c("Saccharina", "Saccharina", "Saccharina", "Agarum", "Agarum"),
  Abundance = c(4, 5, 2, 1, 8)
)
kelpdf

  Year      Taxon Abundance
1 1999 Saccharina         4
2 2000 Saccharina         5
3 2004 Saccharina         2
4 1999     Agarum         1
5 2004     Agarum         8

what is missing?

how to fill it?

solution: `complete()`

complete(kelpdf,
         Year, Taxon)

# A tibble: 6 x 3
   Year Taxon      Abundance
  <dbl> <fct>          <dbl>
1  1999 Agarum             1
2  1999 Saccharina         4
3  2000 Agarum            NA
4  2000 Saccharina         5
5  2004 Agarum             8
6  2004 Saccharina         2

example from imachorda.com

fill all combinations 2/3

from tidyr

Actually, Agarum was recorded in 2000, but absent

how to fill out this info with 0s?

solution: `complete()`, option `fill`

complete(kelpdf,
         Year, Taxon,
         fill = list(Abundance = 0))

# A tibble: 6 x 3
   Year Taxon      Abundance
  <dbl> <fct>          <dbl>
1  1999 Agarum             1
2  1999 Saccharina         4
3  2000 Agarum             0
4  2000 Saccharina         5
5  2004 Agarum             8
6  2004 Saccharina         2

example from imachorda.com

fill all combinations 3/3

from tidyr

Wait, what happen between 2000 and 2004?

how to fill out this info with 0s?

solution: `complete()`, option `fill` and helper `full_seq()`

complete(kelpdf,
         # helper tidyr::full_seq
         Year = full_seq(Year, period = 1),
         Taxon,
         fill = list(Abundance = 0))

# A tibble: 12 x 3
    Year Taxon      Abundance
   <dbl> <fct>          <dbl>
 1  1999 Agarum             1
 2  1999 Saccharina         4
 3  2000 Agarum             0
 4  2000 Saccharina         5
 5  2001 Agarum             0
 6  2001 Saccharina         0
 7  2002 Agarum             0
 8  2002 Saccharina         0
 9  2003 Agarum             0
10  2003 Saccharina         0
11  2004 Agarum             8
12  2004 Saccharina         2

example from imachorda.com

nesting completion

combinations of variables

Do we have all visits per patient?

separate_rows(patient_df,
              visit_id, measured,
              convert = TRUE)

# A tibble: 6 x 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1003        1       12

solution: `complete()`, with helper `nesting()`

patient_complete <- separate_rows(patient_df,
                                  visit_id, measured,
                                  convert = TRUE) %>% 
  complete(subject_id, nesting(visit_id))
patient_complete

# A tibble: 9 x 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1002        3       NA
7       1003        1       12
8       1003        2       NA
9       1003        3       NA

How to keep hierachical data in a rectangle?

nesting tables

solution: `nest()`

patient_nested <- patient_complete %>% 
  nest(visit_id, measured)

Warning: All elements of `...` must be named.
Did you want `data = c(visit_id, measured)`?

patient_nested

# A tibble: 3 x 2
  subject_id           data
       <int> <list<df[,2]>>
1       1001        [3 × 2]
2       1002        [3 × 2]
3       1003        [3 × 2]

Advantages

common data structures are hierarchical, e.g. patient-centric with repeat observations
nesting allows to store collapsed tibbles and simplifies data management
unnesting unfold the data

`unnest()`

unnest(patient_nested)

Warning: `cols` is now required.
Please use `cols = c(data)`

# A tibble: 9 x 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1002        3       NA
7       1003        1       12
8       1003        2       NA
9       1003        3       NA

better described as grouped data

`group_by` and `nest()`

patient_complete %>%
  group_by(subject_id) %>%
  nest(.key = "visit")

Warning: `.key` is deprecated

# A tibble: 3 x 2
# Groups:   subject_id [3]
  subject_id          visit
       <int> <list<df[,2]>>
1       1001        [3 × 2]
2       1002        [3 × 2]
3       1003        [3 × 2]

nested column (list-column) is named data by default

Wrap up

Covered here

tidyverse, introduction
tidy data
- http://tidyr.tidyverse.org/
- vignette(“tidy-data”)
tidyr
- long / wide with pivot functions
- data cleaning
- nesting
- present helpers: complete, separate_rows and nest()
switch from base to tidyverse, Rajesh Korde, blog’ post

Acknowledgments

Hadley Wickham
Roland Krause
Jeremy Stanley

Tidyverse

workflow

Pipeline

Tidyverse, packages in processes

https://www.tidyverse.org/

Tidyverse components

core / extended

Core

Extended

Tidyverse criticism

dialect

Tidyverse criticism

controversy

Easily summarized

Tidyverse criticism

jobs

Personal complains

Tidyverse

trends

Tidyverse

trends

The dream team

Pipes with magrittr

developed by Stefan Milton Bache

compare the approaches

R base

magrittr

Règles de la tuyauterie

Bob Rudis (hrbrmstr)

tidy data

Learning objectives

Definitions

Comments

Basic: everything in the data rectangle

with header

Rationale

Semantics

Definitions

Definition

Tidy data

Bad data exercise

online

Tidy errors

Data cleaning exercise

offline

Clean the “bad table”

Common tidy data violations

Problems

introduction: cheat sheets

Convert Long / wide format

Demo with mtcars

count

pivot wider

pivot longer

Animation

by Garrick Aden-Buie

Demo with the fish dataset by Myfanwy Johnson

pivot_wider

Demo with the fish dataset

pivot_longer

Warning

Why tidy is useful?

separate / unite

unite

separate

Basic data cleaning

Separate rows

from ugly tables

Multiple values per cell

Combinations of variables

Comment

separate() and unite()

exercice

create valid dates format YYYY-MM-DD

solution unite()

separate() and unite()

exercice

explod YYYY-MM-DD by the ‘-’

solution separate()

fill all combinations 1/3

`separate()` and `unite()`

solution `unite()`

`separate()` and `unite()`

solution `separate()`

solution: `complete()`

solution: `complete()`, option `fill`

solution: `complete()`, option `fill` and helper `full_seq()`

solution: `complete()`, with helper `nesting()`

solution: `nest()`

`unnest()`

`group_by` and `nest()`