book reference: Michael J. Crawley. 2005. Statistics: An introduction using R
A value, median, mean, p-value etc. does not mean much without confidence intervals or ranges permitting the comparison of two or more distributions.
Here, we are going to:
In this dataset, 5 different clipping procedures on plant samples were tested for their effect on biomass production:
n25
and n50
.r5
and r10
.We would like to know which procedure leads to the highest biomass production.
Download the the tabulation-separated file competition.txt
and import it into R as a tibble named clipping
.
Use ggplot2
to draw a boxplot representation of the 5 levels.
A popular alternative to a boxplot is the barplot with error bars representation. It is possible to represent different kind of error-bars.
Here, we decide to use the standard error of the mean based on the pooled error variance from the ANOVA. ggplot()
is not able to calculate these errors by itself and we will first need to generate the relevant error values. For this, we can use the output of the ANOVA function in R (aov()
).
aov()
ANOVA function?aov
) we can read that the aov()
function requires at least a formula argument.~
) and is constructed as follows: response ~ factor
which would mean that the variable response
depends on the variable factor
.aov()
function and the formula.Before interpreting the eventual significance of this test, let us calculate the pooled error variance \(s^{2}\) (also called the residual mean square) of the ANOVA.
\[ MSE = s^{2} = \frac{SSE}{DFE} \]
where SSE is the sum of squares of the residuals and DFE is the degree of freedom for the error.
Store the ANOVA result in a variable called a
Determine the type of the object a
From now on, we would like to stick to a tidyverse approach. Which package would you use to handle these kind of objects?
Identify the function providing the value of \(s^{2}\) and extract it. Save the value with the name mse
To better appreciate how the \(s^{2}\) value is obtained, you can calculate it yourself starting from the clipping data frame.
nest()
. No need to group as we have only one pedictor.
broom
to obtain the residuals and the degree of freedom for the error. To identify the suitable functions, use the object a
you created before to test the different broom
functions and select the appropriate ones.
broom
functions you selected before (using the object a
) to identify the names of your columns of interest. Use these names to extract the values of interest.
Now create a tibble
containing for each group of clipping method: the mean, number of measures (\(n\)) and the standard error of the mean defined as \(se = \sqrt{s^{2}/n}\). \(s^2\) was computed in the first part, and saved with the name mse
## # A tibble: 1 x 4
## clipping mean n se
## <chr> <chr> <chr> <chr>
## 1 ? ? ? ?
geom_errorbar()
) using the calculated standard error of the mean.width = 0.5
option.
x
fold the standard error).Alternatively we can represent the 95% confidence intervals instead of standard errors of the means to draw the error bars.
A compromise between the confidence intervals and the standard error is the use of the Least Significant Difference (LSD) measure defined as:
\[LSD = qt(0.975, clipping) \times \text{std error of the difference}\]
Where the degree of freedom for the difference is the sum of both clipping
thus the sum of both counts - 2.
Before interpreting the significance of the ANOVA, let us check if the prerequisites are met.
bartlett.test()
) is useful for one of the ANOVA’s assumption
Nest the clipping dataset to a single cell and perform the Bartlett test.
Get the result of the Bartlett test as a tibble.
purrr
to:
gather()
to get a nice output.
Is it possible to perform the ANOVA?
Using the summary()
function, check the p-value of the ANOVA.
What is your conclusion?
The ANOVA is able to tell us if there is a difference between the biomass means obtained after different clipping methods but it does not tell us which method might be responsible for the difference.
TukeyHSD()
function which perform such a post-hoc test for ANOVA analyses.
Why are these post-hoc tests important?
perform the post-hoc test
Represent the conclusion of this test graphically: Use the tidy()
function (from the broom
library) to generate a suitable data.frame and plot the ranges with geom_pointrange()
.
What is your final conclusion in particular with regard to the previously produced plots?