TD datasaurus

This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets easily. Those datasets are already compile in a tidy tibble, cleaning steps will come in future prracticals.

`datasauRus` package

check if you have the package datasauRus installed

library(datasauRus)

should return nothing. If there is no package called ‘datasauRus’ appears, it means that the package needs to be installed. Use this:

install.packages("datasauRus")

Explore the dataset

Since we are dealing with a tibble, we can just type

datasaurus_dozen

only the first 10 rows are displayed.

dataset	x	y
dino	55.3846	97.1795
dino	51.5385	96.0256
dino	46.1538	94.4872
dino	42.8205	91.4103
dino	40.7692	88.3333
dino	38.7179	84.8718
dino	35.6410	79.8718
dino	33.0769	77.5641
dino	28.9744	74.4872
dino	26.1538	71.4103

what are the dimensions of this dataset? Rows and columns?

base version, using either dim(), ncol() and nrow()
tidyverse version

assign the `datasaurus_dozen` to the `ds_dozen` object. This aims at populating the Global Environment

using Rstudio, those dimensions are now also reported within the interface, where?

How many datasets are present?

base version

Tip

you want to count the number of unique elements in the column dataset. The function length() returns the length of a vector, such as the unique elements

tidyverse version

summarise(ds_dozen, n = n_distinct(dataset))

## # A tibble: 1 x 1
##       n
##   <int>
## 1    13

even better way, compute and display the number of lines per dataset

Tip

the function count in dplyr does the group_by() by the specified column + summarise(n = n()) which returns the number of observation per defined group.

Check summary statistics per dataset

compute the mean of the `x` & `y` column. For this, you need to `group_by()` the appropriate column and then `summarise()`

Tip

in summarise() you can define as many new columns as you wish. No need to call it for every single variable.

compute the standard deviation of the `x` & `y` column in a same way

do then all in one go using `summarise_if` so we exclude the `dataset` column and compute the others

what can you conclude?

Plot the datasauRus

plot the `ds_dozen` with `ggplot` such the aesthetics are `aes(x = x, y = y)`

with the geometry geom_point()

Tip

the ggplot() and geom_point() functions must be linked with a + sign

reuse the above command, and now colored by the `dataset` column

too many datasets are displayed, how can we plot only one at a time?

adjust the filtering step to plot two datasets?

Tip

R provides the inline instruction %in% to test if there a match of the left operand in the right one (a vector most probably)

tweak the theme and use the `theme_void` and remove the legend

are the datasets actually that similar?

Tip

the R package gifski could be installed on your machine, makes the GIF creation faster.

install `gganimate`, its dependencies will be automatically installed.

use the `dataset` variable to the `transition_states()` argument layer

visualized as small the differences in means for both coordinates

need to zoom tremendously to see almost nothing. Accumule all states to better see the motions.

Conclusion

never trust summary statistics alone; always visualize your data | Alberto Cairo

Authors

Alberto Cairo, (creator)
Justin Matejka
George Fitzmaurice
Lucy McGowan

from this post

TD datasaurus

Aurelien Ginolhac

2019-09-17

`datasauRus` package

Explore the dataset

what are the dimensions of this dataset? Rows and columns?

assign the `datasaurus_dozen` to the `ds_dozen` object. This aims at populating the Global Environment

using Rstudio, those dimensions are now also reported within the interface, where?

How many datasets are present?

Tip

Tip

Check summary statistics per dataset

compute the mean of the `x` & `y` column. For this, you need to `group_by()` the appropriate column and then `summarise()`

Tip

compute the standard deviation of the `x` & `y` column in a same way

do then all in one go using `summarise_if` so we exclude the `dataset` column and compute the others

what can you conclude?

Plot the datasauRus

plot the `ds_dozen` with `ggplot` such the aesthetics are `aes(x = x, y = y)`

Tip

reuse the above command, and now colored by the `dataset` column

too many datasets are displayed, how can we plot only one at a time?

adjust the filtering step to plot two datasets?

Tip

expand now by getting one `dataset` per facet

remove the filtering step to facet all datasets

tweak the theme and use the `theme_void` and remove the legend

are the datasets actually that similar?

Tip

install `gganimate`, its dependencies will be automatically installed.

use the `dataset` variable to the `transition_states()` argument layer

visualized as small the differences in means for both coordinates

Conclusion

TD datasaurus

Aurelien Ginolhac

2019-09-17

datasauRus package

Explore the dataset

what are the dimensions of this dataset? Rows and columns?

assign the datasaurus_dozen to the ds_dozen object. This aims at populating the Global Environment

using Rstudio, those dimensions are now also reported within the interface, where?

How many datasets are present?

Tip

Tip

Check summary statistics per dataset

compute the mean of the x & y column. For this, you need to group_by() the appropriate column and then summarise()

Tip

compute the standard deviation of the x & y column in a same way

do then all in one go using summarise_if so we exclude the dataset column and compute the others

what can you conclude?

Plot the datasauRus

plot the ds_dozen with ggplot such the aesthetics are aes(x = x, y = y)

Tip

reuse the above command, and now colored by the dataset column

too many datasets are displayed, how can we plot only one at a time?

adjust the filtering step to plot two datasets?

Tip

expand now by getting one dataset per facet

remove the filtering step to facet all datasets

tweak the theme and use the theme_void and remove the legend

are the datasets actually that similar?

Tip

install gganimate, its dependencies will be automatically installed.

use the dataset variable to the transition_states() argument layer

visualized as small the differences in means for both coordinates

Conclusion

`datasauRus` package

assign the `datasaurus_dozen` to the `ds_dozen` object. This aims at populating the Global Environment

compute the mean of the `x` & `y` column. For this, you need to `group_by()` the appropriate column and then `summarise()`

compute the standard deviation of the `x` & `y` column in a same way

do then all in one go using `summarise_if` so we exclude the `dataset` column and compute the others

plot the `ds_dozen` with `ggplot` such the aesthetics are `aes(x = x, y = y)`

reuse the above command, and now colored by the `dataset` column

expand now by getting one `dataset` per facet

tweak the theme and use the `theme_void` and remove the legend

install `gganimate`, its dependencies will be automatically installed.

use the `dataset` variable to the `transition_states()` argument layer