TD - Data transformation

playing with `dplyr`

getting familiar with transforming data with dplyr

Starwars

load the tidyverse package and explore the starwars dataset.
count how many individuals originate from each homeworld
sort the previous output to highlight the #1 location
in the previous top, the number 3 is unknown with 10 characters, who are they?
count and sort the number of characters per homeworld and species
from the previous output, add a step to filter out the
- count of 1
- missing species
- missing homeworld
now convert the long format to the wide one: i. e first column homeworld, then 7 columns of species
you see that many missing data are arising since we look at all combinations. Replace all NA values by 0.

Tip

You can find at least 3 solutions:

specifying an option in the pivot_wider function
using a tidyr helper on the long format data
using a dplyr function once in wide format

You may try all of them and see which one is the easiest for you.

compute the mean, median and standard deviation for height and mass per species. Add a column n to know from how many individuals the values were computed. Filter out n values smaller than 2.

Tip

some values are missing, filter them out from both height and mass columns

continuing the workflow, compute
- the difference between mean and median for height.
- the difference between mean and median for mass.
- the standard error of the mean (sd / sqrt(n)) for height
- the standard error of the mean for mass
- select only those four new columns + species.
- save the flow as diff_med_mean
finally, filter out the rows where the absolute difference between the mean and median is greater than the sem in any of height or mass.

chromosome 21 using `biomaRt`

If you are missing biomaRt, install it with:

# install if missing # install.packages("BiocManager") BiocManager::install("biomaRt")

Get the data for chromosome 21 from `biomaRt`.

library(biomaRt)
gene_mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
                     host = "www.ensembl.org")
gene_set <- useDataset(gene_mart , dataset = "hsapiens_gene_ensembl")

gene_by_exon <- as_tibble(getBM(
  mart = gene_set,
  attributes = c(
    "ensembl_gene_id",
    "ensembl_transcript_id",
    "ensembl_exon_id",
    "chromosome_name",
    "start_position",
    "end_position",
    "hgnc_symbol",
    "hgnc_id",
    "strand",
    "gene_biotype",
    "phenotype_description"
  ), 
  filter = "chromosome_name",
  value = "21"
))

#write_csv(gene_by_exon, here::here("data", "gene_by_exon.csv"))

Extract the processed pseudogenes from the `genes_by_exon` data set.

Convert the genes_by_exon data set to a tibble.
Use glimpse() to find tin which column this info is stored
Use distinct() on this column to identify how pseudogenes are coded.
Store the results in a tibble called pseudogenes.

Tip

in the package stringr, see the function str_detect() that look for the presence of a substring and return a logical vector. In combination with filter() you should be able to extract all “pseudogene” genes

Count the number of pseudogenes in the set (without referring to `table()`)

Sort the genes by their length.

Calculate the average genomic length per `gene_biotype`.

Calculate the total number of transcripts per gene and their average length by `gene_biotype`.

gene_by_exon %>%
  filter(gene_biotype == "bidirectional_promoter_lncRNA") %>%
  arrange(ensembl_exon_id)

## # A tibble: 12 x 11
##    ensembl_gene_id ensembl_transcr… ensembl_exon_id chromosome_name
##    <chr>           <chr>            <chr>                     <dbl>
##  1 ENSG00000223768 ENST00000647108  ENSE00001542583              21
##  2 ENSG00000223768 ENST00000454115  ENSE00001542586              21
##  3 ENSG00000223768 ENST00000647108  ENSE00001655745              21
##  4 ENSG00000223768 ENST00000454115  ENSE00001668643              21
##  5 ENSG00000223768 ENST00000647108  ENSE00001697127              21
##  6 ENSG00000223768 ENST00000400362  ENSE00001714446              21
##  7 ENSG00000223768 ENST00000454115  ENSE00001714446              21
##  8 ENSG00000223768 ENST00000647108  ENSE00001714446              21
##  9 ENSG00000223768 ENST00000400362  ENSE00001747474              21
## 10 ENSG00000223768 ENST00000400362  ENSE00003821463              21
## 11 ENSG00000223768 ENST00000433465  ENSE00003823847              21
## 12 ENSG00000223768 ENST00000433465  ENSE00003829032              21
## # … with 7 more variables: start_position <dbl>, end_position <dbl>,
## #   hgnc_symbol <chr>, hgnc_id <chr>, strand <dbl>, gene_biotype <chr>,
## #   phenotype_description <chr>

What is the most frequent single word in the phenotype description on chromosome 21? Split the column using `separate_rows()`, and `count` in a single pipe workflow

Tip

you may convert all strings to lower case to be case insensitive. See tolower()

playing with `tidyr`

Convert the chol_by_visit tibble to wide format such that the values in `chol` are mapped as values in columns `visit`. Note that for `1L` L only specifies integers.

chol_by_visit <- tribble(
  ~sampleid, ~visit, ~chol,
  "S1", 1L, 120.0,
  "S1", 2L, 178,
  "S2", 1L, 180,
  "S2", 2L, 221,
  "S2", 3L, 240,
  "S3", 1L, 122,
  "S3", 2L, 160,
  "S3", 3L, 154
)

The result should look like this:

sampleid	1	2	3
S1	120	178	NA

Why do we observe a missing value for S1?

Now we add the table of variants per sampleid.

variants <- tribble(
  ~sampleid, ~var1, ~var2, ~var3,
  "S1", "A3T", "T5G", "T6G",
  "S2", "A3G", "T5G", NA,
  "S3", "A3T", "T6C", "G10C",
  "S4", "A3T", "T6C", "G10C"
)

Is this data tidy? Is not, clean it up such that all variants appear as a column labeled by their position.

Warning

biomaRt also has a function called select(). If it was loaded after dplyr then use dplyr::select() syntax to specify the appropriate one

Tip

the var1, 2 or 3 are build the same way:

a character for reference allele
a number for the position
a character for mutated allele (_i. e = variant)

you can once in the long format, split each of the 3 informations into its own column using separate(x, y, sep = 1:3) where x is the column of mutations (3 characters merged) and y a character vector containing the 3 column names of your choice. Like c("ref", "pos", "derived").

Then you can go back to the wide format and use both the ref allele and the position as colum names.

Same question but using not separate but regular expressions and stringr

Select relevant variants

We obtained a third dataset, the signifiance of variants, where damaging are indicated with D.

variant_significance <- tribble(
  ~variant, ~significance,
  "A3T", "U",
  "A3G", "D",
  "T5G", "B",
  "T6G", "D",
  "T6C", "B",
  "G10C", "U"
)

TD - Data transformation

A. Ginolhac, R. Krause, E. Koncina

2019-10-10

playing with `dplyr`

Starwars

Tip

Tip

chromosome 21 using `biomaRt`

Get the data for chromosome 21 from `biomaRt`.

Extract the processed pseudogenes from the `genes_by_exon` data set.

Tip

Count the number of pseudogenes in the set (without referring to `table()`)

Sort the genes by their length.

Calculate the average genomic length per `gene_biotype`.

Calculate the total number of transcripts per gene and their average length by `gene_biotype`.

What is the most frequent single word in the phenotype description on chromosome 21? Split the column using `separate_rows()`, and `count` in a single pipe workflow

Tip

playing with `tidyr`

Convert the chol_by_visit tibble to wide format such that the values in `chol` are mapped as values in columns `visit`. Note that for `1L` L only specifies integers.

Why do we observe a missing value for S1?

Now we add the table of variants per sampleid.

Is this data tidy? Is not, clean it up such that all variants appear as a column labeled by their position.

Warning

Tip

Same question but using not separate but regular expressions and stringr

Select relevant variants

We obtained a third dataset, the signifiance of variants, where damaging are indicated with D.

Select the subjects via table variants that carry variants with damaging variants

Try using semi-join to achieve the same result.

TD - Data transformation

A. Ginolhac, R. Krause, E. Koncina

2019-10-10

playing with dplyr

Starwars

Tip

Tip

chromosome 21 using biomaRt

Get the data for chromosome 21 from biomaRt.

Extract the processed pseudogenes from the genes_by_exon data set.

Tip

Count the number of pseudogenes in the set (without referring to table())

Sort the genes by their length.

Calculate the average genomic length per gene_biotype.

Calculate the total number of transcripts per gene and their average length by gene_biotype.

What is the most frequent single word in the phenotype description on chromosome 21? Split the column using separate_rows(), and count in a single pipe workflow

Tip

playing with tidyr

Convert the chol_by_visit tibble to wide format such that the values in chol are mapped as values in columns visit. Note that for 1L L only specifies integers.

Why do we observe a missing value for S1?

Now we add the table of variants per sampleid.

Is this data tidy? Is not, clean it up such that all variants appear as a column labeled by their position.

Warning

Tip

Same question but using not separate but regular expressions and stringr

Select relevant variants

We obtained a third dataset, the signifiance of variants, where damaging are indicated with D.

Select the subjects via table variants that carry variants with damaging variants

Try using semi-join to achieve the same result.

playing with `dplyr`

chromosome 21 using `biomaRt`

Get the data for chromosome 21 from `biomaRt`.

Extract the processed pseudogenes from the `genes_by_exon` data set.

Count the number of pseudogenes in the set (without referring to `table()`)

Calculate the average genomic length per `gene_biotype`.

Calculate the total number of transcripts per gene and their average length by `gene_biotype`.

What is the most frequent single word in the phenotype description on chromosome 21? Split the column using `separate_rows()`, and `count` in a single pipe workflow

playing with `tidyr`

Convert the chol_by_visit tibble to wide format such that the values in `chol` are mapped as values in columns `visit`. Note that for `1L` L only specifies integers.