September 2019

Tidyverse

workflow

Pipeline

Tidyverse, packages in processes

https://www.tidyverse.org/

Tidyverse components

core / extended

Core

  • ggplot2, for data visualization
  • dplyr, for data manipulation
  • tidyr, for data tidying
  • readr, for data import
  • purrr, for functional programming
  • tibble, for tibbles, a modern re-imagining of data frames

source: http://tidyverse.tidyverse.org/. H.Wickham

Extended

  • Working with specific types of vectors:
    • hms, for times
    • stringr, for strings
    • lubridate, for date/times
    • forcats, for factors
  • Importing other types of data:
    • feather, for sharing data
    • haven, for SPSS, SAS and Stata files
    • httr, for web apis
    • jsonlite for JSON
    • readxl, for .xls and .xlsx files
    • rvest, for web scraping
    • xml2, for XML files
  • Modelling
    • modelr, for modelling within a pipeline
    • broom, for models -> tidy data

Tidyverse criticism

dialect

Tidyverse criticism

controversy

Tidyverse criticism

jobs

Personal complains

  • still young so change quickly and drastically. Backward compatibility is not always maintained.
  • tibbles are nice but a lot of non-tidyverse packages require matrices. rownames still an issue.

Anyway, learning the tidyverse does not prevent to learn R base, it helps to get things done early in the process

Tidyverse

trends

Tidyverse

trends

The dream team

Pipes with magrittr

developed by Stefan Milton Bache

compare the approaches

  • classic parenthesis syntax
  • magrittr pipeline

R base

set.seed(12)
round(mean(rnorm(5)), 2)
[1] -0.76

magrittr

set.seed(12)
rnorm(5) %>%
  mean() %>%
  round(2)
[1] -0.76

Of note, magrittr needs to loaded with either:

library(magrittr)
library(dplyr)
library(tidyverse)

Règles de la tuyauterie

Bob Rudis (hrbrmstr)

tidy data

Learning objectives

Definitions

  • Principles of tidy data to structure data
  • Find errors in existing data sets
  • Structure data
  • Reshaping data with tidyr

Comments

  • Cleaning data also requires dplyr
  • tidyr and dplyr are intertwined
  • Focus on “tidy data”
  • Introduction of tidyr ways

Basic: everything in the data rectangle

with header

Rationale

All happy families are alike; each unhappy family is unhappy in its own way. Leo Tolstoy, Anna Karenina

Semantics

Definitions

  • Variable: A quantity, quality, or property that you can measure.
  • Observation: A set of values that display the relationship between variables. To be an observation, values need to be measured under similar conditions, usually measured on the same observational unit at the same time.
  • Value: The state of a variable that you observe when you measure it.

source: Garret Grolemund and vignette("tidy-data")

Definition

Tidy data

  1. Each variable is in its own column
  2. Each observation is in its own row
  3. Each value is in its own cell

Bad data exercise

online

  • The following table lists missense variants in a gene in a group of patients
  • What’s wrong with this Excel sheet?
  • Which problems are tidy issues

Tidy errors

Error Tidy violation Comment
Patient names No Data protection violation
Identical column names Yes Variable error
Inconsistent variables names No Bad practice
Non-English columns names No Bad practice
Color coding No The horror, the horror
Inconsistent dates No Use ISO8601
Multiple columns for one item Yes One observation per line
Redundant information Yes Each variable is in its own column
Repeated rows Yes Each observation is in its own row
Uncoded syndromes Yes/No Each value in its own cell
Unnecessary information No like birthdate, comments: bad practice
Name of the table No You’ll see this often

Data cleaning exercise

offline

Clean the “bad table”

  • Bring data into shape such that it conforms to tidy data requirements
  • Pay attention to details of format, less to actual data
  • Do not use R for doing the manipulations

10 min, discuss

Common tidy data violations

Problems

  • Column headers are values, not variable names (gather)
  • Multiple variables stored in one column (separate)
  • Variables are stored in both rows and columns (gather-spread)
  • Repeated observations (nest or table)
  • Multiple types in one table (dplyr data transformation)
  • One type in multiple tables (dplyr, combine into single table)

introduction: cheat sheets

Convert Long / wide format

The wide format is generally untidy but found in the majority of datasets

Demo with mtcars

count

count(mtcars, am, cyl)
# A tibble: 6 x 3
     am   cyl     n
  <dbl> <dbl> <int>
1     0     4     3
2     0     6     4
3     0     8    12
4     1     4     8
5     1     6     3
6     1     8     2

pivot wider

count(mtcars, am, cyl) %>%
  pivot_wider(names_from = cyl, 
              values_from = n) -> wcars
wcars
# A tibble: 2 x 4
     am   `4`   `6`   `8`
  <dbl> <int> <int> <int>
1     0     3     4    12
2     1     8     3     2

pivot longer

wcars %>% 
  pivot_longer(cols = -am,
               names_to = "cyl",
               values_to = "n",
  # optional prototype to get integers (not chr)
               names_ptypes = list(cyl = integer()))
# A tibble: 6 x 3
     am   cyl     n
  <dbl> <int> <int>
1     0     4     3
2     0     6     4
3     0     8    12
4     1     4     8
5     1     6     3
6     1     8     2

Animation

by Garrick Aden-Buie

Demo with the fish dataset by Myfanwy Johnson

pivot_wider

head(fish_encounters, 3)
# A tibble: 3 x 3
  fish  station  seen
  <fct> <fct>   <int>
1 4842  Release     1
2 4842  I80_1       1
3 4842  Lisbon      1
dim(fish_encounters)
[1] 114   3
library(tidyverse)
fish_wide <- fish_encounters %>%
  pivot_wider(
    names_from = station,
    values_from = seen
  )
fish_wide
# A tibble: 19 x 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1
 4 4845        1     1      1     1       1    NA    NA    NA    NA    NA
 5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA
 6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA
 7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA
 8 4850        1     1     NA     1       1     1     1    NA    NA    NA
 9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA
10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA
11 4855        1     1      1     1       1    NA    NA    NA    NA    NA
12 4857        1     1      1     1       1     1     1     1     1    NA
13 4858        1     1      1     1       1     1     1     1     1     1
14 4859        1     1      1     1       1    NA    NA    NA    NA    NA
15 4861        1     1      1     1       1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1    NA
17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA
18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA
19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA
# … with 1 more variable: MAW <int>

Demo with the fish dataset

pivot_longer

fish_wide %>% 
  pivot_longer(
    cols = -fish,
    names_to = "station",
    values_to = "seen"
  )
# A tibble: 209 x 3
   fish  station  seen
   <fct> <chr>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
# … with 199 more rows

Warning

Note that we get more rows than in the original dataset, as missing combination are now NA

Why tidy is useful?

iris %>%
  pivot_longer(cols = -Species,
               names_to = "flower", 
               values_to = "measure") %>%
  ggplot() +
  geom_boxplot(aes(x = Species, y = measure, fill = flower))

separate / unite

demo_tibble <- tibble(year  = c(2015, 2014, 2014),
                      month = c(11L, 2L, 4L),    # create a vector of integers
                      day   = c(23, 1, 30),      # default is double
                      value = c("high", "low", "low"))
demo_tibble
# A tibble: 3 x 4
   year month   day value
  <dbl> <int> <dbl> <chr>
1  2015    11    23 high 
2  2014     2     1 low  
3  2014     4    30 low  

unite

demo_tibble %>%
  unite(date, c(year, month, day),
        sep = "-") -> demo_tibble_unite
demo_tibble_unite
# A tibble: 3 x 2
  date       value
  <chr>      <chr>
1 2015-11-23 high 
2 2014-2-1   low  
3 2014-4-30  low  

separate

use quotes since we are not refering to objects

demo_tibble_unite %>%
  separate(date, c("year", "month", "day"))
# A tibble: 3 x 4
  year  month day   value
  <chr> <chr> <chr> <chr>
1 2015  11    23    high 
2 2014  2     1     low  
3 2014  4     30    low  

Basic data cleaning

Separate rows

from ugly tables

Multiple values per cell

patient_df <- tibble(
    subject_id = 1001:1003, 
    visit_id = c("1,2,3", "1,2", "1"),
    measured = c("9,0, 11", "11, 3" , "12")  )
patient_df
# A tibble: 3 x 3
  subject_id visit_id measured
       <int> <chr>    <chr>   
1       1001 1,2,3    9,0, 11 
2       1002 1,2      11, 3   
3       1003 1        12      

Note the incoherent white space

Combinations of variables

separate_rows(patient_df,
              visit_id, measured,
              convert = TRUE) # chr -> int
# A tibble: 6 x 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1003        1       12

Comment

To split single variables use separate

separate() and unite()

exercice

create valid dates format YYYY-MM-DD

dummy <- data_frame(year = c(2015, 2014, 2014),
                    month = c(11, 2, 4),
                    day = c(23, 1, 30),
                    value = c("high", "low", "low"))
Warning: `data_frame()` is deprecated, use `tibble()`.
This warning is displayed once per session.
dummy
# A tibble: 3 x 4
   year month   day value
  <dbl> <dbl> <dbl> <chr>
1  2015    11    23 high 
2  2014     2     1 low  
3  2014     4    30 low  

solution unite()

dummy_unite <- unite(dummy, date,
                     year, month, day,
                     sep = "-")
dummy_unite
# A tibble: 3 x 2
  date       value
  <chr>      <chr>
1 2015-11-23 high 
2 2014-2-1   low  
3 2014-4-30  low  

separate() and unite()

exercice

explod YYYY-MM-DD by the ‘-’

solution separate()

  • Use quotes since we are not refering to objects
  • Default split on non-alphanumeric characters
separate(dummy_unite, 
         date, c("year", "month", "day"))
# A tibble: 3 x 4
  year  month day   value
  <chr> <chr> <chr> <chr>
1 2015  11    23    high 
2 2014  2     1     low  
3 2014  4     30    low  

fill all combinations 1/3

kelpdf <- data.frame(
  Year = c(1999, 2000, 2004, 1999, 2004),
  Taxon = c("Saccharina", "Saccharina", "Saccharina", "Agarum", "Agarum"),
  Abundance = c(4, 5, 2, 1, 8)
)
kelpdf
  Year      Taxon Abundance
1 1999 Saccharina         4
2 2000 Saccharina         5
3 2004 Saccharina         2
4 1999     Agarum         1
5 2004     Agarum         8

what is missing?

how to fill it?

solution: complete()

complete(kelpdf,
         Year, Taxon)
# A tibble: 6 x 3
   Year Taxon      Abundance
  <dbl> <fct>          <dbl>
1  1999 Agarum             1
2  1999 Saccharina         4
3  2000 Agarum            NA
4  2000 Saccharina         5
5  2004 Agarum             8
6  2004 Saccharina         2

example from imachorda.com

fill all combinations 2/3

from tidyr

Actually, Agarum was recorded in 2000, but absent

how to fill out this info with 0s?

solution: complete(), option fill

complete(kelpdf,
         Year, Taxon,
         fill = list(Abundance = 0))
# A tibble: 6 x 3
   Year Taxon      Abundance
  <dbl> <fct>          <dbl>
1  1999 Agarum             1
2  1999 Saccharina         4
3  2000 Agarum             0
4  2000 Saccharina         5
5  2004 Agarum             8
6  2004 Saccharina         2

example from imachorda.com

fill all combinations 3/3

from tidyr

Wait, what happen between 2000 and 2004?

how to fill out this info with 0s?

solution: complete(), option fill and helper full_seq()

complete(kelpdf,
         # helper tidyr::full_seq
         Year = full_seq(Year, period = 1),
         Taxon,
         fill = list(Abundance = 0))
# A tibble: 12 x 3
    Year Taxon      Abundance
   <dbl> <fct>          <dbl>
 1  1999 Agarum             1
 2  1999 Saccharina         4
 3  2000 Agarum             0
 4  2000 Saccharina         5
 5  2001 Agarum             0
 6  2001 Saccharina         0
 7  2002 Agarum             0
 8  2002 Saccharina         0
 9  2003 Agarum             0
10  2003 Saccharina         0
11  2004 Agarum             8
12  2004 Saccharina         2

example from imachorda.com

nesting completion

combinations of variables

Do we have all visits per patient?

separate_rows(patient_df,
              visit_id, measured,
              convert = TRUE)
# A tibble: 6 x 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1003        1       12

solution: complete(), with helper nesting()

patient_complete <- separate_rows(patient_df,
                                  visit_id, measured,
                                  convert = TRUE) %>% 
  complete(subject_id, nesting(visit_id))
patient_complete
# A tibble: 9 x 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1002        3       NA
7       1003        1       12
8       1003        2       NA
9       1003        3       NA

How to keep hierachical data in a rectangle?

nesting tables

solution: nest()

patient_nested <- patient_complete %>% 
  nest(visit_id, measured)
Warning: All elements of `...` must be named.
Did you want `data = c(visit_id, measured)`?
patient_nested
# A tibble: 3 x 2
  subject_id           data
       <int> <list<df[,2]>>
1       1001        [3 × 2]
2       1002        [3 × 2]
3       1003        [3 × 2]

Advantages

  • common data structures are hierarchical, e.g. patient-centric with repeat observations
  • nesting allows to store collapsed tibbles and simplifies data management
  • unnesting unfold the data

unnest()

unnest(patient_nested)
Warning: `cols` is now required.
Please use `cols = c(data)`
# A tibble: 9 x 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1002        3       NA
7       1003        1       12
8       1003        2       NA
9       1003        3       NA

better described as grouped data

group_by and nest()

patient_complete %>%
  group_by(subject_id) %>%
  nest(.key = "visit")
Warning: `.key` is deprecated
# A tibble: 3 x 2
# Groups:   subject_id [3]
  subject_id          visit
       <int> <list<df[,2]>>
1       1001        [3 × 2]
2       1002        [3 × 2]
3       1003        [3 × 2]

Wrap up