book reference: Michael J. Crawley. 2005. Statistics: An introduction using R

Graphically comparing multiple distributions

A value, median, mean, p-value etc. does not mean much without confidence intervals or ranges permitting the comparison of two or more distributions.
Here, we are going to:

  1. perform two different graphical representations to compare 5 distributions (boxplots and barplots with error bars)
  2. perform a statistical analysis to test whether these 5 distributions are different

The clipping dataset

In this dataset, 5 different clipping procedures on plant samples were tested for their effect on biomass production:

  • One control (unclipped)
  • Two intensities in the shoot clipping, n25 and n50.
  • Two intensities in the root clipping, r5 and r10.

We would like to know which procedure leads to the highest biomass production.

  • Download the the tabulation-separated file competition.txt and import it into R as a tibble named clipping.

  • Describe the data
    • are the two series discrete or continuous?
    • for discrete, list the levels

Boxplot representation

Use ggplot2 to draw a boxplot representation of the 5 levels.

  • Do you observe any outlier?
  • What do the horizontal black lines represent?
  • By looking at the medians: Would you expect a statistical difference between the control and the clipping treatments?

Barplots with error bars representation

A popular alternative to a boxplot is the barplot with error bars representation. It is possible to represent different kind of error-bars.
Here, we decide to use the standard error of the mean based on the pooled error variance from the ANOVA. ggplot() is not able to calculate these errors by itself and we will first need to generate the relevant error values. For this, we can use the output of the ANOVA function in R (aov()).

Calling the aov() ANOVA function

Tip

  • From the help (?aov) we can read that the aov() function requires at least a formula argument.
  • In R, a formula is characterized by the presence of the tilde character (~) and is constructed as follows: response ~ factor which would mean that the variable response depends on the variable factor.
  • Construct the formula describing how the clipping procedure would affect the biomass production.
  • Now call the ANOVA using the aov() function and the formula.

Error bars with standard error of the means

Before interpreting the eventual significance of this test, let us calculate the pooled error variance \(s^{2}\) (also called the residual mean square) of the ANOVA.

\[ MSE = s^{2} = \frac{SSE}{DFE} \]

where SSE is the sum of squares of the residuals and DFE is the degree of freedom for the error.

Store the ANOVA result in a variable called a

  • Determine the type of the object a

  • From now on, we would like to stick to a tidyverse approach. Which package would you use to handle these kind of objects?

  • Identify the function providing the value of \(s^{2}\) and extract it. Save the value with the name mse

Calculate \(s^{2}\) yourself using a dplyr pipeline

To better appreciate how the \(s^{2}\) value is obtained, you can calculate it yourself starting from the clipping data frame.

  • First, generate a tibble containing a single row and 2 columns: the first containing the data and the second the result of the ANOVA.

Tip

use the Jenny Bryan approach, with nest(). No need to group as we have only one pedictor.
  • Add two additional columns that will contain the residuals and the degree of freedom of the residuals.

Tip

Use broom to obtain the residuals and the degree of freedom for the error. To identify the suitable functions, use the object a you created before to test the different broom functions and select the appropriate ones.
  • extract the residuals and the degree of freedom as two new columns and discard the previous ones.

Tip

Have a look at the output of the broom functions you selected before (using the object a) to identify the names of your columns of interest. Use these names to extract the values of interest.
  • calculate the sum of square of the residuals and \(s^2\)

Computing error bars

Now create a tibble containing for each group of clipping method: the mean, number of measures (\(n\)) and the standard error of the mean defined as \(se = \sqrt{s^{2}/n}\). \(s^2\) was computed in the first part, and saved with the name mse

Warning

The MSE is the pooled error variance, based on the residuals of all levels. Each group exhibits the same standard error of the mean if the study is balanced (i.e each level has the same n).
## # A tibble: 1 x 4
##   clipping mean  n     se   
##   <chr>    <chr> <chr> <chr>
## 1 ?        ?     ?     ?
  • Draw a barchart and add a layer with error bars (geom_errorbar()) using the calculated standard error of the mean.

Tip

By default error bars are as large as the bars, you might want to make them smaller using the width = 0.5 option.
  • Which means appear different from the others?
  • In your opinion, what should be the minimal distance between two means in order to consider them significantly different (express the distance in terms of x fold the standard error).

Error bars with 95% confidence intervals.

Alternatively we can represent the 95% confidence intervals instead of standard errors of the means to draw the error bars.

  • From the t distribution, by which value should we multiply the value of the standard error to obtain the 95% confidence interval? Think about the degree of freedom and the quantile to use.
  • Change your previous dataframe accordingly and re-draw the barplots
  • What can you conclude?

Error bars with the Least Significant Difference

A compromise between the confidence intervals and the standard error is the use of the Least Significant Difference (LSD) measure defined as:

\[LSD = qt(0.975, clipping) \times \text{std error of the difference}\]

Where the degree of freedom for the difference is the sum of both clipping thus the sum of both counts - 2.

  • Re-draw the barplots using the LSD measure for the error bars.

Summary of the ANOVA

Checking ANOVA assumptions

Before interpreting the significance of the ANOVA, let us check if the prerequisites are met.

  • What should we check?

Tip

The Bartlett test (bartlett.test()) is useful for one of the ANOVA’s assumption
  • Nest the clipping dataset to a single cell and perform the Bartlett test.
    Get the result of the Bartlett test as a tibble.

  • To test the normality, use purrr to:
    • perform the ANOVA
    • extract the residuals
    • perform the Shapiro-Wilk test on the residuals
    • format the result as a (nested) data frame

Tip

Once you have the result of the shapiro test and the Bartlett test as nested tibbles, select both columns and use gather() to get a nice output.

Is it possible to perform the ANOVA?

Interpreting the ANOVA

Using the summary() function, check the p-value of the ANOVA.
What is your conclusion?

Post-hoc analysis

The ANOVA is able to tell us if there is a difference between the biomass means obtained after different clipping methods but it does not tell us which method might be responsible for the difference.

Tip

Check the TukeyHSD() function which perform such a post-hoc test for ANOVA analyses.
  • Why are these post-hoc tests important?

  • perform the post-hoc test

  • Represent the conclusion of this test graphically: Use the tidy() function (from the broom library) to generate a suitable data.frame and plot the ranges with geom_pointrange().

  • What is your final conclusion in particular with regard to the previously produced plots?