Analysis of Variance

December 2019

Learning objectives

You will learn to:

perform an analysis of variance
understand that it is only a linear model
assess the output
perform post-hoc tests

Reading

Statistics. An introduction using R by Michael J. Crawley
Biostatistic Design and Analysis Using R by Murray Logan (free pdf)

t-tests seen as linear regression

garden A & B

t-tests seen as linear regression

garden A & C

t-test = linear models

does the difference in intercept ≠ 0

# no Welch correction
t.test(ozone ~ garden, 
       data = gardenAC,
       var.equal = TRUE)

    Two Sample t-test

data:  ozone by garden
t = -1.6036, df = 18, p-value = 0.1262
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.6203102  0.6203102
sample estimates:
mean in group gardenA mean in group gardenC 
                    3                     5

note that \(-1.6036^2 = 2.571\)

lm(ozone ~ garden,
   data = gardenAC) %>%
  tidy()

# A tibble: 2 x 5
  term          estimate std.error statistic p.value
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)       3        0.882      3.40 0.00318
2 gardengardenC     2.00     1.25       1.60 0.126

# A tibble: 1 x 7
  r.squared adj.r.squared sigma statistic p.value    df df.residual
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>       <int>
1     0.125        0.0764  2.79      2.57   0.126     2          18

F = t square for simple linear regression

tibble(q = seq(0, 3, length.out = 100),
       t = (pt(q, df = 18) * 2) - 1,
       f = pf((q)^2, df1 = 1,
              df2 = 18)) %>%
  pivot_longer(names_to = "distribution", 
               values_to = "pdf", 
               cols = -q) %>%
  ggplot(aes(q)) +
  geom_line(aes(y = pdf,
                colour = distribution),
            size = 2, alpha = 0.75) +
  theme(legend.position = c(0.8, 0.2))

F & t square in a shiny app

For categorical data, R create a dummy variable based on factors

the `(Intercept)` term is the mean of the first level

gardenAC %>% 
  mutate(dummy = if_else(garden == "gardenA", 0, 1)) %>%
  lm(ozone ~ dummy, data = .) %>% tidy()

# A tibble: 2 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)     3        0.882      3.40 0.00318
2 dummy           2.00     1.25       1.60 0.126

original dummy coding

tidy(lm(ozone ~ garden,
        data = gardenAC))

# A tibble: 2 x 5
  term          estimate std.error statistic p.value
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)       3        0.882      3.40 0.00318
2 gardengardenC     2.00     1.25       1.60 0.126

Using factors

`forcats`

see if we inverse the dummy coding

gardenAC %>% 
  mutate(dummy = if_else(garden == "gardenA", 1, 0)) %>%
  lm(ozone ~ dummy, data = .) %>% tidy()

# A tibble: 2 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        5     0.882      5.67 0.0000223
2 dummy             -2     1.25      -1.60 0.126

using `fct_relevel()`

gardenAC %>% 
  mutate(garden = fct_relevel(garden, "gardenC")) %>%
  lm(ozone ~ garden, data = .) %>%
  tidy()

# A tibble: 2 x 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)          5     0.882      5.67 0.0000223
2 gardengardenA       -2     1.25      -1.60 0.126

ANOVA

ANalysis Of VAriance

ANOVA

response

is still continous

but predictors

are categorical

Goal: compare 2 or more means

one-way: one categorical variable
two-ways: 2 categorical variables
three-ways: 3 categorical variables
interactions between predictors possible. Actually, they are just linear models

ANOVA

assumptions

residuals normally distributed

although robust as long as sample sizes and variances are equal

equal variance

var should stay in a 3:1 ratio
Bartlett test (bartlett.test())
transform otherwise

independence

between levels

non-parametric alternative: Kruskal-Wallis test (kruskal.test())

one-way ANOVA with 2 levels = linear model

using `lm()`

lm(ozone ~ garden, data = gardenAC) %>%
  tidy()

# A tibble: 2 x 5
  term          estimate std.error statistic p.value
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)       3        0.882      3.40 0.00318
2 gardengardenC     2.00     1.25       1.60 0.126

R function for computing an ANOVA is `aov()`

aov(ozone ~ garden, data = gardenAC) %>%
  tidy()

# A tibble: 2 x 6
  term         df sumsq meansq statistic p.value
  <chr>     <dbl> <dbl>  <dbl>     <dbl>   <dbl>
1 garden        1  20.0  20.0       2.57   0.126
2 Residuals    18 140.    7.78     NA     NA

ANOVA with a factor with 2 levels

\[F = \frac{variance\; between}{variance\; within}\]

Variance within

SSY

Overall variation, total sum of squares SSY

Variance within

SSY

Overall variation, total sum of squares SSY

Variance between

SSE

Variance between, the error sum of squares SSE

Variance between

SSE

Variance between, the error sum of squares SSE

Analysis of variance works as

when means are significantly different, then the sum of squares computed from the individual treatment means will be smaller than the sum of squares computed from the overall mean M. Crawley

Total variance = explained variance (main effect) + unexplained variance

\[SSY = SSA + SSE\]

\(SSA\) is unknown, and need to be computed by \(SSY - SSE\)

ANOVA detailed

total and unexplained variation

SSY, total variance

SSY <- gardenAB %>%
  ungroup() %>%
  summarise(SSY = sum((ozone - mean(ozone))^2))
SSY

# A tibble: 1 x 1
    SSY
  <dbl>
1    44

SSE, unexplained variance

SSE <- gardenAB %>%
  group_by(garden) %>%
  summarise(SS = sum((ozone - mean(ozone))^2)) %>%
  pull(SS) %>% sum()
SSE

[1] 24

The question now is ’how much of this 44 is attributable to differences between means of A & B (SSA = explained variation) and how much is sampling error (SSE = unexplained variation) Michael J. Crawley

ANOVA detailed

explained variation

treatment is the main effect: SSA

i.e the total variation minus the individual errors.

(SSA <- 44 - (12 + 12))

[1] 20

ANOVA table

Source	Sum of Squares	Degrees of freedom	Mean Square	F ratio
Garden (SSA)	20	1 (#levels -1)	20	15
Error (SSE)	24	18 (n - #levels)	1.33333 (= s^2)	NA
Total (SSY)	44	19	NA	NA

Mean Square = Sum of Squares / degrees of freedom

\(s^2\) stands for residual mean square

\[F = \frac{Explained\; by\; effect}{Unexplained} = \frac{\bar{var}(SSA)}{\bar{var}(SSE)}\]

ANOVA test

we observe a F ratio of 15.

Look up the F table for critical value, note that is a one-tail test

one-tail test

qf(0.95, df1 = 1, df2 = 18)

[1] 4.413873

How much significant?

1 - pf(15, df1 = 1, df2 = 18)

[1] 0.001114539

ANOVA in R

summary(aov(ozone ~ garden, data = gardenAB))

            Df Sum Sq Mean Sq F value  Pr(>F)   
garden       1     20  20.000      15 0.00111 **
Residuals   18     24   1.333                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

all we know is that treatment has a significant effet, the means are different. To display all factors, use summary.lm

linear regression formula

\[ Y = \beta_0 + \beta_1 x_1\]

where \(x_1\) is continuous

ANOVA, the equation becomes

\[ Y = \beta_0 + \beta_{11} x_{11} + \beta_{12} x_{12} \]

where \(x_{11}\) and \(x_{12}\) are dummies for each levels of the factor \(x_1\)

Remember: in R

the (Intercept) term is the mean of the first level, i.e. \(\beta_0 + \beta_{11} x_{11}\)

ANOVA seen as a linear regression

`summary.lm(aov())`

Call:
aov(formula = ozone ~ garden, data = gardenAB)

Residuals:
   Min     1Q Median     3Q    Max 
    -2     -1      0      1      2 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.0000     0.3651   8.216 1.67e-07 ***
gardengardenB   2.0000     0.5164   3.873  0.00111 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.155 on 18 degrees of freedom
Multiple R-squared:  0.4545,    Adjusted R-squared:  0.4242 
F-statistic:    15 on 1 and 18 DF,  p-value: 0.001115

garden	mean
gardenA	3
gardenB	5

(Intercept) is mean(level1) = mean(gardenA) = 3
mean(gardenB) = mean(gardenA) + 2 = 5

gardengardenB is seen as the difference between means

ANOVA seen as a linear regression

3 levels, t.test is out

Call:
aov(formula = ozone ~ garden, data = .)

Residuals:
   Min     1Q Median     3Q    Max 
 -4.00  -1.75   0.00   1.00   6.00 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.0000     0.7503   3.998 0.000444 ***
gardengardenB   2.0000     1.0611   1.885 0.070258 .  
gardengardenC   2.0000     1.0611   1.885 0.070258 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.373 on 27 degrees of freedom
Multiple R-squared:  0.1493,    Adjusted R-squared:  0.08624 
F-statistic: 2.368 on 2 and 27 DF,  p-value: 0.1128

garden	mean
gardenA	3
gardenB	5
gardenC	5

(Intercept) still mean(level1)
next estimates are relative the 1st level

F tells the 3 means are not different

ANOVA summary table

3 levels

Df	Sum Sq	Mean Sq	F value	Pr(>F)
2	26.667	13.333	2.368	0.113
27	152.000	5.630	NA	NA

two-ways ANOVA

quinn example

variables

density of adult limpets (patella)
during 2 seasons
record egg mass production
at intertidal phase
in triplicate
link: quinn dataset
mind reading the factors…

source: Murray Logan 2009 inspired by Quinn and Keough 2002

two-ways ANOVA

questions

is the egg production affected by the adult density?
is the egg production affected by season?
is the egg production affected by the adult density similarly at all seasons?

source: Murray Logan 2009 inspired by Quinn and Keough 2002

translate in R formula

two-ways ANOVA, limpets

answers

is the egg production affected by the adult density?
is the egg production affected by season?
is the egg production affected by the adult density similarly at all seasons?

aov(EGGS ~ DENSITY * SEASON, data = quinn) %>% 
  summary()

               Df Sum Sq Mean Sq F value   Pr(>F)    
DENSITY         3  5.284   1.761   9.669 0.000704 ***
SEASON          1  3.250   3.250  17.842 0.000645 ***
DENSITY:SEASON  3  0.165   0.055   0.301 0.823955    
Residuals      16  2.915   0.182                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

yes, egg production depends on the adult density
yes, egg production depends on season
the effect of adult density is consistent across seasons

two-ways ANOVA, limpets

diagnostic plot

from the anova model:
- get the residuals
- get fitted values
- one function from broom does it well
plot residuals ~ fitted
visible trends?

two-ways ANOVA, limpets

diagnostic plot

aov(EGGS ~ DENSITY * SEASON,
    data = quinn) %>%
  augment() %>%
  ggplot(aes(.fitted, .resid)) + 
  geom_point() +
  geom_smooth(method = "loess")

satisfactory?

yes, no visible trends

two-ways ANOVA

limpets, final plot

final plot

from the anova model:
- get fitted values
- get the std error of fitted (default, not computed)
plot DENSITY ~ fitted, colored by SEASON
- joined by a line
- add std error on each fitted point
do you observe an interaction between predictors?

two-ways ANOVA

limpets, final plot

Using predicted EGG production

aov(EGGS ~ DENSITY * SEASON,
    data = quinn) %>%
  augment(se_fit = TRUE) %>%
  ggplot(aes(DENSITY, .fitted,
             colour = SEASON)) + 
  geom_pointrange(aes(ymin = .fitted - .se.fit,
                      ymax = .fitted + .se.fit)) +
  geom_line(aes(group = SEASON)) +
  theme(legend.justification = c(1, 1),
        legend.position = c(.9, .9))

note that parallel lines = no interaction

using means

quinn %>% group_by(DENSITY, SEASON) %>% 
  summarise(M_PROD = mean(EGGS)) %>%
  ggplot() +
  geom_tile(aes(x = DENSITY, y = SEASON,
                fill = M_PROD)) +
  scale_fill_viridis_c() +
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0))

using ANOVA

aov(EGGS ~ DENSITY * SEASON, data = quinn) %>%
  augment() %>%
  ggplot() +
  geom_tile(aes(x = DENSITY, y = SEASON, 
                fill = .fitted)) +
  scale_fill_viridis_c() +
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0))

how can we know which mean-comparisons are significant?

Limitations of signifiance

When we conclude that a term is signifiant, all we know is that the means are different between levels. We may be interested in knowing which mean is lower/higher

lm(EGGS ~ DENSITY * SEASON, data = quinn) %>% summary()

Call:
lm(formula = EGGS ~ DENSITY * SEASON, data = quinn)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6667 -0.2612 -0.0610  0.2292  0.6647 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.41667    0.24642   9.807  3.6e-08 ***
DENSITY15              -0.23933    0.34849  -0.687  0.50206    
DENSITY30              -0.85133    0.34849  -2.443  0.02655 *  
DENSITY45              -1.21700    0.34849  -3.492  0.00301 ** 
SEASONsummer           -0.58333    0.34849  -1.674  0.11358    
DENSITY15:SEASONsummer -0.41633    0.49284  -0.845  0.41069    
DENSITY30:SEASONsummer -0.17067    0.49284  -0.346  0.73363    
DENSITY45:SEASONsummer -0.02367    0.49284  -0.048  0.96229    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4268 on 16 degrees of freedom
Multiple R-squared:  0.749, Adjusted R-squared:  0.6392 
F-statistic: 6.822 on 7 and 16 DF,  p-value: 0.000745

# same with: summary.lm(aov(EGGS ~ DENSITY * SEASON, data = quinn))

post-hoc tests

raw output

Tukey Honest Significant Diff

aov(EGGS ~ DENSITY * SEASON, data = quinn) %>% TukeyHSD()

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = EGGS ~ DENSITY * SEASON, data = quinn)

$DENSITY
            diff        lwr         upr     p adj
15-8  -0.4475000 -1.1525067  0.25750672 0.3020713
30-8  -0.9366667 -1.6416734 -0.23165995 0.0076800
45-8  -1.2288333 -1.9338401 -0.52382662 0.0006998
30-15 -0.4891667 -1.1941734  0.21584005 0.2342310
45-15 -0.7813333 -1.4863401 -0.07632662 0.0273273
45-30 -0.2921667 -0.9971734  0.41284005 0.6441191

$SEASON
                diff      lwr        upr     p adj
summer-spring -0.736 -1.10538 -0.3666196 0.0006453

$`DENSITY:SEASON`
                          diff        lwr         upr     p adj
15:spring-8:spring  -0.2393333 -1.4458501  0.96718341 0.9962324
30:spring-8:spring  -0.8513333 -2.0578501  0.35518341 0.2856238
45:spring-8:spring  -1.2170000 -2.4235167 -0.01048326 0.0472578
8:summer-8:spring   -0.5833333 -1.7898501  0.62318341 0.7025671
15:summer-8:spring  -1.2390000 -2.4455167 -0.03248326 0.0419597
30:summer-8:spring  -1.6053333 -2.8118501 -0.39881659 0.0054858
45:summer-8:spring  -1.8240000 -3.0305167 -0.61748326 0.0016303
30:spring-15:spring -0.6120000 -1.8185167  0.59451674 0.6546904
45:spring-15:spring -0.9776667 -2.1841834  0.22885008 0.1613268
8:summer-15:spring  -0.3440000 -1.5505167  0.86251674 0.9699272
15:summer-15:spring -0.9996667 -2.2061834  0.20685008 0.1450778
30:summer-15:spring -1.3660000 -2.5725167 -0.15948326 0.0208894
45:summer-15:spring -1.5846667 -2.7911834 -0.37814992 0.0061579
45:spring-30:spring -0.3656667 -1.5721834  0.84085008 0.9587034
8:summer-30:spring   0.2680000 -0.9385167  1.47451674 0.9925771
15:summer-30:spring -0.3876667 -1.5941834  0.81885008 0.9446738
30:summer-30:spring -0.7540000 -1.9605167  0.45251674 0.4194243
45:summer-30:spring -0.9726667 -2.1791834  0.23385008 0.1652262
8:summer-45:spring   0.6336667 -0.5728501  1.84018341 0.6178730
15:summer-45:spring -0.0220000 -1.2285167  1.18451674 1.0000000
30:summer-45:spring -0.3883333 -1.5948501  0.81818341 0.9442053
45:summer-45:spring -0.6070000 -1.8135167  0.59951674 0.6631287
15:summer-8:summer  -0.6556667 -1.8621834  0.55085008 0.5803457
30:summer-8:summer  -1.0220000 -2.2285167  0.18451674 0.1300342
45:summer-8:summer  -1.2406667 -2.4471834 -0.03414992 0.0415823
30:summer-15:summer -0.3663333 -1.5728501  0.84018341 0.9583182
45:summer-15:summer -0.5850000 -1.7915167  0.62151674 0.6998230
45:summer-30:summer -0.2186667 -1.4251834  0.98785008 0.9978410

what p adj means / why?
which correction is used?

post-hoc tests

brooming & filtering

tidy anova and filter relevant

aov(EGGS ~ DENSITY * SEASON, data = quinn) %>%
  TukeyHSD() %>% 
  tidy() %>%
  arrange(adj.p.value) %>%
  filter(adj.p.value < 0.05)

# A tibble: 11 x 6
   term          comparison         estimate conf.low conf.high adj.p.value
   <chr>         <chr>                 <dbl>    <dbl>     <dbl>       <dbl>
 1 SEASON        summer-spring        -0.736    -1.11   -0.367     0.000645
 2 DENSITY       45-8                 -1.23     -1.93   -0.524     0.000700
 3 DENSITY:SEAS… 45:summer-8:spring   -1.82     -3.03   -0.617     0.00163 
 4 DENSITY:SEAS… 30:summer-8:spring   -1.61     -2.81   -0.399     0.00549 
 5 DENSITY:SEAS… 45:summer-15:spri…   -1.58     -2.79   -0.378     0.00616 
 6 DENSITY       30-8                 -0.937    -1.64   -0.232     0.00768 
 7 DENSITY:SEAS… 30:summer-15:spri…   -1.37     -2.57   -0.159     0.0209  
 8 DENSITY       45-15                -0.781    -1.49   -0.0763    0.0273  
 9 DENSITY:SEAS… 45:summer-8:summer   -1.24     -2.45   -0.0341    0.0416  
10 DENSITY:SEAS… 15:summer-8:spring   -1.24     -2.45   -0.0325    0.0420  
11 DENSITY:SEAS… 45:spring-8:spring   -1.22     -2.42   -0.0105    0.0473

post-hoc tests

plot

snake plot

from the Tukey post-hoc tests
tidy them
sort comparison by estimate
plot estimate ~ comparison
pointrange estimate +/- CI
color by < 0.05 TRUE/FALSE
add a dashed horizontal line at y = 0
consider flipping axes for readability

post-hoc tests

plotting

tidy data.frame

aov(EGGS ~ DENSITY * SEASON, data = quinn) %>%
  TukeyHSD() %>% 
  tidy() %>%
  ggplot(aes(x = fct_reorder(comparison, 
                             estimate), 
             y = estimate)) +
  geom_pointrange(aes(ymin = conf.low,
                      ymax = conf.high,
          color = adj.p.value < 0.05)) +
  geom_hline(yintercept = 0, 
             linetype = "dashed") +
  coord_flip() +
  labs(x = NULL,
       colour = "padj < .05")

post-hoc tests

facet plot

snake plot and facetting

from the previous snakeplot
facet by term

post-hoc tests

plotting

adding `facet_wrap(~ term)`

ANCOVA

short introduction

ANCOVA

continuous and categorical predictors

We estimate a slope and a intercept for each level of one or more factor Michael Crawley

A continous variable added to an ANOVA is a covariate

source: M. Logan, Chapter 15

On the right panel, could predict an interaction?

ANCOVA to increase explained variance

Adjust mean estimate of levels using the covariate’s mean

source: M. Logan, Chapter 15

ANCOVA, covariate must be in the range

source: M. Logan, Chapter 15

Before we stop

Acknowledgments

Some of the figures in this presentation are taken from:

Biostatistic Design and Analysis Using R. (Wiley-Blackwell, 2011) Murray Logan

People

Hákon Jónsson
Luc Buissière

Learning objectives

You will learn to:

Reading

t-tests seen as linear regression

garden A & B

t-tests seen as linear regression

garden A & C

t-test = linear models

does the difference in intercept ≠ 0

F = t square for simple linear regression

F & t square in a shiny app

For categorical data, R create a dummy variable based on factors

the (Intercept) term is the mean of the first level

original dummy coding

Using factors

forcats

see if we inverse the dummy coding

using fct_relevel()

ANOVA

ANalysis Of VAriance

ANOVA

response

but predictors

Goal: compare 2 or more means

ANOVA

assumptions

residuals normally distributed

equal variance

independence

one-way ANOVA with 2 levels = linear model

using lm()

R function for computing an ANOVA is aov()

ANOVA with a factor with 2 levels

Variance within

SSY

Variance within

SSY

Variance between

SSE

Variance between

SSE

Analysis of variance works as

Total variance = explained variance (main effect) + unexplained variance

ANOVA detailed

total and unexplained variation

SSY, total variance

SSE, unexplained variance

ANOVA detailed

explained variation

treatment is the main effect: SSA

ANOVA table

ANOVA test

one-tail test

How much significant?

ANOVA in R

linear regression formula

ANOVA, the equation becomes

Remember: in R

ANOVA seen as a linear regression

summary.lm(aov())

ANOVA seen as a linear regression

3 levels, t.test is out

ANOVA summary table

3 levels

two-ways ANOVA

quinn example

variables

two-ways ANOVA

questions

questions

two-ways ANOVA, limpets

answers

two-ways ANOVA, limpets

diagnostic plot

two-ways ANOVA, limpets

diagnostic plot

two-ways ANOVA

limpets, final plot

final plot

two-ways ANOVA

the `(Intercept)` term is the mean of the first level

`forcats`

using `fct_relevel()`

using `lm()`

R function for computing an ANOVA is `aov()`

`summary.lm(aov())`

adding `facet_wrap(~ term)`