October 2019

Biostatistics

You will learn to:

  • basics of statistic
  • term definition
  • theoritical distributions
  • compare 2 samples

Reading

Statistic

Clearly identify response variables from explanatory (or predictor) variables.

below, which is explanatory/response?

  • blood pressure
  • cholesterol blood concentration

solution

  • blood pressure, response, y axis
  • cholesterol blood concentration, explanatory x axis

Fuel consumption and car weight example

geom_smooth() with linear modelling

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm")

  • Miles per US galon (mpg) is the response
  • Weigth (wt) is a predictor

Correlation

variable order does not matter

cor.test(mtcars$mpg, mtcars$wt)
    Pearson's product-moment correlation

data:  mtcars$mpg and mtcars$wt
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9338264 -0.7440872
sample estimates:
       cor 
-0.8676594 
cor.test(mtcars$wt, mtcars$mpg)
    Pearson's product-moment correlation

data:  mtcars$wt and mtcars$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9338264 -0.7440872
sample estimates:
       cor 
-0.8676594 

Derive an equation to predict response y

linear regression

summary(lm(mtcars$mpg ~ mtcars$wt))
Call:
lm(formula = mtcars$mpg ~ mtcars$wt)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
mtcars$wt    -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

\(cor^2\) = \(r^2\)

cor(mtcars$wt, mtcars$mpg)
[1] -0.8676594
cor(mtcars$wt, mtcars$mpg)^2
[1] 0.7528328

Correlation is not causality

Variables

Continuous

  • any real numbered value
women$height[1:10]
 [1] 58 59 60 61 62 63 64 65 66 67

Categorical

  • factor with levels
ToothGrowth$supp[1:10]
 [1] VC VC VC VC VC VC VC VC VC VC
Levels: OJ VC
str(ToothGrowth$supp[1:10])
 Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2

Convert numbers to categories

mtcars$cyl[1:10]
 [1] 6 6 4 6 8 6 8 4 4 6
factor(mtcars$cyl[1:10])
 [1] 6 6 4 6 8 6 8 4 4 6
Levels: 4 6 8

Convert factor to strings

as.character(ToothGrowth$supp[1:10])
 [1] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"

Appropriate statistical methods

Explanatory Response Method
continuous continuous regression
continuous continuous correlation
categorical continuous analysis of variance (anova)
continuous / categorical continuous analysis of covariance (ancova)
continuous proportion logistic regression
continuous count logit regression
continuous time-at-death survival analysis

What for?

hypotheses

A good hypothesis is a falsifiable hypothesis Karl Popper

Which of these two is a good hypothesis?

#1

  • There are vultures in the local park

#2

  • There are no vultures in the local park

Null hypotheses

the null hypothesis, \(H_0\), is similar to “nothing is happening”

  • What matters: null hypothesis must be falsifiable.
  • We reject the null hypothesis when the data shows it is sufficiently unlikely.

examples

  • comparison of two means
  • correlation between x and y
  • comparison of three means
  • comparison of two proportions
  • slopes are not zero

Significance

characteristics

  • sufficiently unlikeliness is determined by the significance of statistical tests
  • if the alternate hypothesis \(H_1\) occurs more than 5% of the time, we reject \(H_0\)
  • significance is a threshold chosen by the user
  • has to be compared to the p value obtained

p values are probabilities (\(0 \leq p \leq 1\)) that a particular result occurred by chance if \(H_0\) is true.

if \(H_0\) is true is almost never true. Instead a mixture of null and alternative hypotheses

Type of errors and interpretation

Two types of errors can be made

and are made at a certain threshold

  • reject \(H_0\) while it was true, type I
  • accept \(H_0\) while it was false, type II

Errors types

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows
Warning: funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once per session.
Null hypothesis is
decision on H0 True False
Reject Type I error Correct
Accept Correct Type II error

Which one is

  • the false positive?
  • the false negative?

the power of a test is \(1 - \beta\)

Power of t-tests

1/10 difference needs 100 more n

Intial tweet by John Myles White, R code modified from the gist by David Robinson.

Assessing the true difference of means

fixed standard deviation = 1

crossing(n = 3:30,
         delta = NULL,
         sig.level = .05,
         power = seq(.3, .9, .2),
         sd = 1,
         type = c("two.sample")) %>%
  pmap(.l = ., .f = power.t.test) %>%
  map_df(tidy) %>%
  ggplot(aes(x = n, y = delta)) +
  geom_line(aes(colour = percent(power)),
            size = 1.1) +
  geom_hline(yintercept = 2^0.5, 
             linetype = "dashed") +
  annotate("text", x = 25, y = 1.6, 
   label = "log[2]*\" = 0.5\"",
   parse = TRUE, size = 6) +
  scale_y_continuous(labels = percent) +
  scale_x_continuous(breaks = c(3,5,10,20,30)) +
  theme_minimal(20) +
  labs(x = "# observations (both samples)",
       y = "true difference of means",
       color = "power",
       title = "Power of 2-samples t-test ",
       subtitle = "for sd = 1")

Generate data

sample of population

Your experimental design should take care of two key concepts

replication

  • to increase reliability
  • capture as many events as you can afford
  • must be independent

randomization

  • to reduce bias, cannot test everything

others

  • appropriate controls
  • modelling:
    • parsimony
    • as few parameters as possible
    • linear over non-linear

Mind pseudo-replication

Population versus samples

terms

parameter

  • characteristic of a population
  • greek letters
  • mean \(\mu\), sd \(\sigma\), var \(\sigma^2\)

statistic

  • characteristic of a sample
  • latin letters
  • mean \(\bar{x}\), sd \(s\), var \(s^2\)

models

simulate population

this implies that we have models, distributions that we know and can be used to assess our data.

normal distribution

Two parameters

Student t distribution

For smaller sample sizes, one more parameter: degrees of freedom. \[df = n - p\]

  • \(n\) is the sample size
  • \(p\) nb parameters estimated from data

Thus, 3 parameters:

  • df
  • mean
  • standard deviation

Normal versus t distribution

Larger incertainty leads to bigger tails

comparison normal versus student

Degrees of freedom

Let’s say we have 5 observations with a known average of 4.

first, fetch a 2

1 2 3 4 5
2

average = 2

second, fetch a 7

1 2 3 4 5
2 7

average = 4.5

third, fetch a 4

1 2 3 4 5
2 7 4

average = 4.3333333

fourth, fetch a 0

1 2 3 4 5
2 7 4 0

average = 3.25

Degrees of freedom

But, now only one available spot

and the average must be 4

for 5 elements

we must have a sum of 20.

\[20 - (2 + 7 + 4 + 0) = 7\]

final

1 2 3 4 5
2 7 4 0 7

average = 4

formula

\[df = n - 1 = 5 - 1 = 4\]

Mean / Median / Mode

summary statistics

mean

suffers from extreme values

\[\mu = \frac{\sum x}{n}\]

median

middle value, sort values:

  • even length: average of the two centered values
  • odd length: middle value

mode

value that occurs the most frequently

Mean / Median

plotted

set.seed(123)
dens <- data_frame(x = c(rnorm(500), rnorm(200, 3, 3)))
Warning: `data_frame()` is deprecated, use `tibble()`.
This warning is displayed once per session.
ggplot(dens) +
  geom_line(aes(x), stat = "density") +
  geom_vline(aes(xintercept = mean(x), colour = "mean"), size = 1.1) +
  geom_vline(aes(xintercept = median(x), colour = "median"), size = 1.1) -> p
p

with mode

dens_mode <- density(dens$x)$x[which.max(density(dens$x)$y)]
p + geom_vline(aes(xintercept = dens_mode, colour = "mode"), size = 1.1)

Variance

measures variability

  • dispersion around a central tendency.
  • measured as the sum of differences to the mean, use squares to get rid of sign issue.

formula

\[variance = \frac{sum\,of\,squares}{degrees\,of\,freedom}\]

dispersion are also called residuals

\[residual = observed\,value - predicted\,value\]

Sum of squares

Standard error of the mean

unreliability of measures are called standard errors

  • For the mean, standard errors decrease when the sample size increases
  • could be written as:

x has a mean of \[3.0 \pm 0.365 (1 s.e, n = 10)\]

formula

\[SE_\bar{y} = \sqrt{\frac{s^2}{n}}\]

Confidence intervals

unreliability could be represented as confidence intervals

  • Intervals go wider when the unreliability goes up.
  • confidence intervals are always two-tailed
  • usually associated to the t values from the Student’s distribution

A CI is how many t standard error to be expected for a given interval

formula

\[confidence\,interval = t \times standard\,error\]

Confidence interval

garden example from J. Crawley

garden
# A tibble: 10 x 3
   gardenA gardenB gardenC
     <dbl>   <dbl>   <dbl>
 1       3       5       3
 2       4       5       3
 3       4       6       2
 4       3       7       1
 5       2       4      10
 6       3       4       4
 7       1       3       3
 8       3       5      11
 9       5       6       3
10       2       5      10

mean for garden A

mean(garden$gardenA)
[1] 3

t 2.5% for \(\;\alpha = 0.05\)

qt(0.025, df = 9)
[1] -2.262157
qt(0.975, df = 9)
[1] 2.262157
qt(0.975, df = 9) * sqrt(var(garden$gardenA) / length(garden$gardenA))
[1] 0.826023

confidence interval, mean gardenA

\(\;\bar{x} = 3.0 \pm 0.83 (CI_{95\%})\)

Parameters / Statistics

population

  • mean, \(\mu = \frac{\sum(x)}{N}\)
  • variance, \(\sigma^2 = \frac{\sum(x - \mu)^2}{N}\)
  • standard deviation, \(\sigma = \sqrt{\frac{\sum(x - \mu)^2}{N}}\)

sample (p = 1, mean estimated)

  • mean, \(\bar{x} = \frac{\sum(x)}{n}\)
  • variance, \(s^2 = \frac{\sum(x - \bar{x})^2}{n - 1}\)
  • standard deviation, \(s = \sqrt{\frac{\sum(x - \bar{x})^2}{n - 1}}\)

Functions in R

base

mean()
var()
sd()
# median absolute deviation
mad()

example

a <- garden$gardenA
mean(a)
[1] 3
var(a)
[1] 1.333333
sd(a)
[1] 1.154701
mad(a) # median absolute deviation
[1] 1.4826

Central Limit Theorem

Normal distribution

properties

  • median = mean = mode
  • \(1 \sigma = 68\%\)
  • \(2 \sigma = 95\%\)
  • \(3 \sigma = 99\%\)

z-scores

also called standardized scores, they refer to how many standard deviation to the mean.

Formula to obtain a z-score from a normal distribution

\[z = \frac{x - \mu}{\sigma}\]

z-scores

using R

with 2 \(\sigma\)

1 - 2 * pnorm(-2)
[1] 0.9544997

more precisely

z <- qnorm(0.025)
z
[1] -1.959964
1 - 2 * pnorm(z)
[1] 0.95

relationship between p, q, r for the normal distribution

by Eric Koncina

Working example

Suppose we have measured the height of 100 people: The mean height is 170 cm and the standard deviation is 8 cm.

how much of the population is shorter than 160 cm?

Solution

Calculate the z-score from the distribution

z <- (160 - 170) / 8
z
[1] -1.25

z is a negative value

  • what can we predict about the proportion of people?
  • More or less of 50%?

from z to p, associated probability

i.e area under the bell shape.

pnorm(z)
[1] 0.1056498
paste(round(pnorm(z) * 100, 2), "%") 
[1] "10.56 %"

One / Two side tests

One sample

assumption: normality of the residuals

set.seed(123)
rn <- rnorm(100)
shapiro.test(rn)
    Shapiro-Wilk normality test

data:  rn
W = 0.99388, p-value = 0.9349

one sample t-test comparison to mean = 0

t.test(garden$gardenA)
    One Sample t-test

data:  garden$gardenA
t = 8.2158, df = 9, p-value = 1.788e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 2.173977 3.826023
sample estimates:
mean of x 
        3 

One way t-test

Test if the mean of a one sample if different from a population parameter

\[\mu\ != \bar{x}\]

using R

mean(mtcars$mpg)
[1] 20.09062
t.test(mtcars$mpg, mu = 18, alternative = "two.sided")
    One Sample t-test

data:  mtcars$mpg
t = 1.9622, df = 31, p-value = 0.05876
alternative hypothesis: true mean is not equal to 18
95 percent confidence interval:
 17.91768 22.26357
sample estimates:
mean of x 
 20.09062 

compare 2 means

Two samples

comparing variances

assumptions

  • normality
  • independence
  • similar variances

Fisher’s F test

var.test(sample1, sample2)

Divide the larger variance by the smaller one.

compare the variances of garden A & C

using the garden dataset.

Variance test

doing it by hand

var(garden$gardenA)
[1] 1.333333
var(garden$gardenC)
[1] 14.22222
F.ratio <- var(garden$gardenC) / var(garden$gardenA)
F.ratio
[1] 10.66667

F distribution

tibble(x = seq(-2, 10, length.out = 500),
       y = df(x, 9, 9)) %>%
  ggplot() +
  geom_line(aes(x = x, y = y)) +
  theme_bw(18)
why no values < 0?

why no values < 0?

no values < 0, mind the degrees of freedom!

Variance test

doing it by hand

Hypothesis

  • \(H_0\): the two variances are not statistically different
  • \(H_1\): the two variances are statistically different

define the critical value of F distribution for a risk of \(alpha = 0.05\)

qf(0.975, 9, 9)
[1] 4.025994

Could you already accept or reject \(H_0\)?

Variance test

doing it by hand

compute the exact p-value

2 * (1 - pf(F.ratio, 9, 9))
[1] 0.001624199

now using the build-in R function

var.test(garden$gardenA, garden$gardenC)
    F test to compare two variances

data:  garden$gardenA and garden$gardenC
F = 0.09375, num df = 9, denom df = 9, p-value = 0.001624
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.02328617 0.37743695
sample estimates:
ratio of variances 
           0.09375 

Student’s t test

comparing two means

assumptions

  • the two samples must be independent
  • the two variances have to be similar
  • error must be normally distributed

methods

  • study design
  • F test
  • normality test

if not normal?

non parametric tests

  • Wilcoxon rank sum test
    • samples must be independent
    • errors are not normally distributed

Student’s t test

rationale

  • for small sample size (n < 30)
  • concern most of biological studies
  • inferences needed from population variance \(\,\sigma^2\) using the sample variance \(\,s^2\).
  • more uncertainty, since population parameters cannot be known.

The t test statistic

  • provides the number of standard error by which two sample means are separated.

\[ t = \frac{difference\,between\,two\,means}{standard\,error\,of\,the\,difference} = \frac{ \overline{y_A} - \overline{y_B}}{s.e_{diff}}\]

Standard error of a difference

For two independent samples

the variance of their difference is the sum of their variances.

\[s.e_{diff} = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}\]

Example

garden

GardenA and gardenC have significant different variances

but let’s see for gardenA and B

A vs B

var.test(garden$gardenA, garden$gardenB)
    F test to compare two variances

data:  garden$gardenA and garden$gardenB
F = 1, num df = 9, denom df = 9, p-value = 1
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2483859 4.0259942
sample estimates:
ratio of variances 
                 1 

testing normality

library(broom)
tidy(shapiro.test(garden$gardenA))
# A tibble: 1 x 3
  statistic p.value method                     
      <dbl>   <dbl> <chr>                      
1     0.953   0.703 Shapiro-Wilk normality test
tidy(shapiro.test(garden$gardenB))
# A tibble: 1 x 3
  statistic p.value method                     
      <dbl>   <dbl> <chr>                      
1     0.953   0.703 Shapiro-Wilk normality test

Example

garden

independence

since two different gardens

compute it!

1st diff of means

diff <- mean(garden$gardenA) - mean(garden$gardenB)
diff
[1] -2

compute t statistic

t <- diff / (sqrt((var(garden$gardenA) / 10) + (var(garden$gardenB) / 10)))
t # sign does not matter, how many df?
[1] -3.872983

deduced p-value

2 * pt(t, df = 18)
[1] 0.001114539

Compare 2 means, using R

garden example

Welch correction

  • unequal variance is compensated by lowering df
  • is ON by default
t.test(garden$gardenA, garden$gardenB,
       var.equal = FALSE)
    Welch Two Sample t-test

data:  garden$gardenA and garden$gardenB
t = -3.873, df = 18, p-value = 0.001115
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.0849115 -0.9150885
sample estimates:
mean of x mean of y 
        3         5 
t.test(garden$gardenA, garden$gardenB,
       var.equal = TRUE)
    Two Sample t-test

data:  garden$gardenA and garden$gardenB
t = -3.873, df = 18, p-value = 0.001115
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.0849115 -0.9150885
sample estimates:
mean of x mean of y 
        3         5 

Two dependent samples

paired t test

  • the independence assumption is violated when the samples share more than they should.
  • Such as:
    • two samples from a same individual
    • two samples from the same biotope …

parametric test

t.test(x, y, paired = TRUE)

= a one-sample t test with the difference of samples

Mann and Whitney rank sum

non-parametric

When?

  • when the errors are not normally distributed

Non-parametric alternative to the Student’s t test

wilcox.test()
  • default paired = FALSE, for independent samples, aka Mann-Whitney’s test.
  • with paired = TRUE, for dependent samples, aka Wilcoxon.
  • process:
    • label the values
    • mix
    • rank them. The main issue is ties.

Wilcoxon

example

garden %>%
  # go for long format
  gather(garden, ozone) %>% 
  # focus on A and B
  filter(garden != "gardenC") %>% 
  mutate(rank = rank(ozone, ties.method = "average")) %>% 
  # do the grouping AFTER ranking
  group_by(garden) %>%
  # get the sum!
  summarise(sum(rank)) 
# A tibble: 2 x 2
  garden  `sum(rank)`
  <chr>         <dbl>
1 gardenA          66
2 gardenB         144

compared to the critical value

value has to be < for both \(n = 10\) samples

qwilcox(0.975, 10, 10)
[1] 76

conclusions

  • compare this critical value to the 2 sum(rank).
  • if one of sum is higher than 76 -> we reject \(H_0\).
  • Garden A and B have different means.

Non-parametric versus parametric

if we compare the two p-values we obtain

bind_rows(
  tidy(wilcox.test(
    garden$gardenA,
    garden$gardenB)),
  tidy(t.test(
    garden$gardenA,
    garden$gardenB)))
# A tibble: 2 x 10
  statistic p.value method alternative estimate estimate1 estimate2
      <dbl>   <dbl> <chr>  <chr>          <dbl>     <dbl>     <dbl>
1     11    0.00299 Wilco… two.sided         NA        NA        NA
2     -3.87 0.00111 Welch… two.sided         -2         3         5
# … with 3 more variables: parameter <dbl>, conf.low <dbl>,
#   conf.high <dbl>

non-parametric tests are more conservative with less power

for \(n=3\) power is zero. Minimum is \(n=4\)

Compare two distributions

assumption on distribution shapes, even if unknown

  • non parametric test
  • Kolmogorov-Smirnov test
  • compares 2 distributions by working on their cumulative distribution function
ks.test()
tibble(x = seq(-5, 5, length.out = 100),
       `1` = pnorm(x),
       `2` = pnorm(x, 0, 2),
       `3` = pnorm(x,  0, 3)) %>%
  gather(sd, y, -x) %>%
  ggplot(aes(x, y, colour = sd)) +
  geom_line() +
  theme_bw(18)