dplyrgetting familiar with transforming data with dplyr
load the tidyverse package and explore the starwars dataset.
count how many individuals originate from each homeworld
sort the previous output to highlight the #1 location
in the previous top, the number 3 is unknown with 10 characters, who are they?
count and sort the number of characters per homeworld and species
specieshomeworldnow convert the long format to the wide one: i. e first column homeworld, then 7 columns of species
you see that many missing data are arising since we look at all combinations. Replace all NA values by 0.
You can find at least 3 solutions:
pivot_wider functiontidyr helper on the long format datadplyr function once in wide formatheight and mass per species. Add a column n to know from how many individuals the values were computed. Filter out n values smaller than 2.height.mass.sd / sqrt(n)) for heightmassdiff_med_meanheight or mass.biomaRtIf you are missing biomaRt, install it with:
# install if missing # install.packages("BiocManager") BiocManager::install("biomaRt")
biomaRt.library(biomaRt)
gene_mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org")
gene_set <- useDataset(gene_mart , dataset = "hsapiens_gene_ensembl")
gene_by_exon <- as_tibble(getBM(
mart = gene_set,
attributes = c(
"ensembl_gene_id",
"ensembl_transcript_id",
"ensembl_exon_id",
"chromosome_name",
"start_position",
"end_position",
"hgnc_symbol",
"hgnc_id",
"strand",
"gene_biotype",
"phenotype_description"
),
filter = "chromosome_name",
value = "21"
))
#write_csv(gene_by_exon, here::here("data", "gene_by_exon.csv"))
genes_by_exon data set.genes_by_exon data set to a tibble.glimpse() to find tin which column this info is storeddistinct() on this column to identify how pseudogenes are coded.pseudogenes.stringr, see the function str_detect() that look for the presence of a substring and return a logical vector. In combination with filter() you should be able to extract all “pseudogene” genes
table())gene_biotype.gene_biotype.gene_by_exon %>%
filter(gene_biotype == "bidirectional_promoter_lncRNA") %>%
arrange(ensembl_exon_id)
## # A tibble: 12 x 11
## ensembl_gene_id ensembl_transcr… ensembl_exon_id chromosome_name
## <chr> <chr> <chr> <dbl>
## 1 ENSG00000223768 ENST00000647108 ENSE00001542583 21
## 2 ENSG00000223768 ENST00000454115 ENSE00001542586 21
## 3 ENSG00000223768 ENST00000647108 ENSE00001655745 21
## 4 ENSG00000223768 ENST00000454115 ENSE00001668643 21
## 5 ENSG00000223768 ENST00000647108 ENSE00001697127 21
## 6 ENSG00000223768 ENST00000400362 ENSE00001714446 21
## 7 ENSG00000223768 ENST00000454115 ENSE00001714446 21
## 8 ENSG00000223768 ENST00000647108 ENSE00001714446 21
## 9 ENSG00000223768 ENST00000400362 ENSE00001747474 21
## 10 ENSG00000223768 ENST00000400362 ENSE00003821463 21
## 11 ENSG00000223768 ENST00000433465 ENSE00003823847 21
## 12 ENSG00000223768 ENST00000433465 ENSE00003829032 21
## # … with 7 more variables: start_position <dbl>, end_position <dbl>,
## # hgnc_symbol <chr>, hgnc_id <chr>, strand <dbl>, gene_biotype <chr>,
## # phenotype_description <chr>
separate_rows(), and count in a single pipe workflowtolower()
tidyrchol are mapped as values in columns visit. Note that for 1L L only specifies integers.chol_by_visit <- tribble(
~sampleid, ~visit, ~chol,
"S1", 1L, 120.0,
"S1", 2L, 178,
"S2", 1L, 180,
"S2", 2L, 221,
"S2", 3L, 240,
"S3", 1L, 122,
"S3", 2L, 160,
"S3", 3L, 154
)
The result should look like this:
| sampleid | 1 | 2 | 3 |
|---|---|---|---|
| S1 | 120 | 178 | NA |
variants <- tribble(
~sampleid, ~var1, ~var2, ~var3,
"S1", "A3T", "T5G", "T6G",
"S2", "A3G", "T5G", NA,
"S3", "A3T", "T6C", "G10C",
"S4", "A3T", "T6C", "G10C"
)
biomaRt also has a function called select(). If it was loaded after dplyr then use dplyr::select() syntax to specify the appropriate one
the var1, 2 or 3 are build the same way:
you can once in the long format, split each of the 3 informations into its own column using separate(x, y, sep = 1:3) where x is the column of mutations (3 characters merged) and y a character vector containing the 3 column names of your choice. Like c("ref", "pos", "derived").
variant_significance <- tribble(
~variant, ~significance,
"A3T", "U",
"A3G", "D",
"T5G", "B",
"T6G", "D",
"T6C", "B",
"G10C", "U"
)