dplyr
getting familiar with transforming data with dplyr
load the tidyverse
package and explore the starwars
dataset.
count how many individuals originate from each homeworld
sort the previous output to highlight the #1 location
in the previous top, the number 3 is unknown with 10 characters, who are they?
count and sort the number of characters per homeworld
and species
species
homeworld
now convert the long format to the wide one: i. e first column homeworld
, then 7 columns of species
you see that many missing data are arising since we look at all combinations. Replace all NA
values by 0
.
You can find at least 3 solutions:
pivot_wider
functiontidyr
helper on the long format datadplyr
function once in wide formatheight
and mass
per species
. Add a column n
to know from how many individuals the values were computed. Filter out n
values smaller than 2.height
.mass
.sd / sqrt(n)
) for height
mass
diff_med_mean
height
or mass
.biomaRt
If you are missing biomaRt
, install it with:
# install if missing
# install.packages("BiocManager")
BiocManager::install("biomaRt")
biomaRt
.library(biomaRt)
gene_mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org")
gene_set <- useDataset(gene_mart , dataset = "hsapiens_gene_ensembl")
gene_by_exon <- as_tibble(getBM(
mart = gene_set,
attributes = c(
"ensembl_gene_id",
"ensembl_transcript_id",
"ensembl_exon_id",
"chromosome_name",
"start_position",
"end_position",
"hgnc_symbol",
"hgnc_id",
"strand",
"gene_biotype",
"phenotype_description"
),
filter = "chromosome_name",
value = "21"
))
#write_csv(gene_by_exon, here::here("data", "gene_by_exon.csv"))
genes_by_exon
data set.genes_by_exon
data set to a tibble
.glimpse()
to find tin which column this info is storeddistinct()
on this column to identify how pseudogenes are coded.pseudogenes
.stringr
, see the function str_detect()
that look for the presence of a substring and return a logical vector. In combination with filter()
you should be able to extract all “pseudogene” genes
table()
)gene_biotype
.gene_biotype
.gene_by_exon %>%
filter(gene_biotype == "bidirectional_promoter_lncRNA") %>%
arrange(ensembl_exon_id)
## # A tibble: 12 x 11
## ensembl_gene_id ensembl_transcr… ensembl_exon_id chromosome_name
## <chr> <chr> <chr> <dbl>
## 1 ENSG00000223768 ENST00000647108 ENSE00001542583 21
## 2 ENSG00000223768 ENST00000454115 ENSE00001542586 21
## 3 ENSG00000223768 ENST00000647108 ENSE00001655745 21
## 4 ENSG00000223768 ENST00000454115 ENSE00001668643 21
## 5 ENSG00000223768 ENST00000647108 ENSE00001697127 21
## 6 ENSG00000223768 ENST00000400362 ENSE00001714446 21
## 7 ENSG00000223768 ENST00000454115 ENSE00001714446 21
## 8 ENSG00000223768 ENST00000647108 ENSE00001714446 21
## 9 ENSG00000223768 ENST00000400362 ENSE00001747474 21
## 10 ENSG00000223768 ENST00000400362 ENSE00003821463 21
## 11 ENSG00000223768 ENST00000433465 ENSE00003823847 21
## 12 ENSG00000223768 ENST00000433465 ENSE00003829032 21
## # … with 7 more variables: start_position <dbl>, end_position <dbl>,
## # hgnc_symbol <chr>, hgnc_id <chr>, strand <dbl>, gene_biotype <chr>,
## # phenotype_description <chr>
separate_rows()
, and count
in a single pipe workflowtolower()
tidyr
chol
are mapped as values in columns visit
. Note that for 1L
L only specifies integers.chol_by_visit <- tribble(
~sampleid, ~visit, ~chol,
"S1", 1L, 120.0,
"S1", 2L, 178,
"S2", 1L, 180,
"S2", 2L, 221,
"S2", 3L, 240,
"S3", 1L, 122,
"S3", 2L, 160,
"S3", 3L, 154
)
The result should look like this:
sampleid | 1 | 2 | 3 |
---|---|---|---|
S1 | 120 | 178 | NA |
variants <- tribble(
~sampleid, ~var1, ~var2, ~var3,
"S1", "A3T", "T5G", "T6G",
"S2", "A3G", "T5G", NA,
"S3", "A3T", "T6C", "G10C",
"S4", "A3T", "T6C", "G10C"
)
biomaRt
also has a function called select()
. If it was loaded after dplyr
then use dplyr::select()
syntax to specify the appropriate one
the var1, 2 or 3 are build the same way:
you can once in the long format, split each of the 3 informations into its own column using separate(x, y, sep = 1:3)
where x
is the column of mutations (3 characters merged) and y
a character vector containing the 3 column names of your choice. Like c("ref", "pos", "derived")
.
variant_significance <- tribble(
~variant, ~significance,
"A3T", "U",
"A3G", "D",
"T5G", "B",
"T6G", "D",
"T6C", "B",
"G10C", "U"
)