November 3019

Rationale

Until now

We expected the response to be continuous & normally distributed.

What if the response is

  • bounded at 0?
  • only takes integer values?

Contingency tables

testing association between categorical variables

count data

numbers are only positive integers.

contingency tables

in statistics, contingencies are all the events that could happen.

is there a relationship between gender and school’s type?

d <- tibble(gender = c("Male", "Female"), 
            private = c(77, 54),
            public  = c(52, 65))
d
# A tibble: 2 x 3
  gender private public
  <chr>    <dbl>  <dbl>
1 Male        77     52
2 Female      54     65

Contingency tables

1.

Compute margins

d %>%
  mutate(total = private + public) %>%
  add_row(gender = "Total", 
          private = sum(.$private), 
          public = sum(.$public), 
          total = sum(.$total)) -> d
formattable::formattable(d)
gender private public total
Male 77 52 129
Female 54 65 119
Total 131 117 248

Contingency tables

for Male in private school

if the two variables were independent then the probability is the product of the probability of being a male and being in a private school.

expected values

(129 / 248) * (131 / 248) * 248
[1] 68.14113

Simplification

previous equation could be simplified by \[E = \frac{rowTotal \times colTotal}{grand\; total}\]

(129 * 131) / 248
[1] 68.14113

Contingency tables

2.

expected values for all cells

(d_expect <- d %>%
   mutate_if(is.numeric, 
             funs(total * sum(.) / sum(total))))
Warning: funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once per session.
# A tibble: 3 x 4
  gender private public total
  <chr>    <dbl>  <dbl> <dbl>
1 Male      68.1   60.9   129
2 Female    62.9   56.1   119
3 Total    131    117     248

as always, sum the differences between observed and expected.

For count data, it follows \(\chi^2\) distribution

\[\chi^2 = \sum{\frac{(O-E)^2}{E}}\]

\(\chi^2\) distribution

crossing(tibble(x  = 1:25),
         tibble(df = c(2, 5, 10, 15))) %>%
  mutate(dchsq = dchisq(x, df = df)) %>%
  ggplot(aes(x, dchsq, fill = factor(df))) + geom_col(position = "dodge", width = 2.2, alpha = 0.8) +
  labs(x = NULL, fill = "df") + theme_classic(16)

Contingency tables

3.

Compute Squares of differences

pivot_longer(d, starts_with("p"), "school", values_to = "observed") %>%
  inner_join(pivot_longer(d_expect, starts_with("p"), "school", values_to = "expected")) %>%
  filter(gender != "Total") %>%
  mutate(`o-e`    = (observed - expected),
        `o-e^2`   = (`o-e`)^2,
        `o-e^2/e` =  `o-e^2` / expected) -> dchsq
pander::pander(dchsq)
gender total school observed expected o-e o-e^2 o-e^2/e
Male 129 private 77 68.14 8.859 78.48 1.152
Male 129 public 52 60.86 -8.859 78.48 1.29
Female 119 private 54 62.86 -8.859 78.48 1.249
Female 119 public 65 56.14 8.859 78.48 1.398

Contingency tables

4. compute Sum of Squares

\(\chi^2\)

is the sum of last column

degrees of freedom

\(df = (ncol - 1) \times (nrow - 1)\)

\(\chi^2\)

5.0876587

degrees of freedom

\(df = (2 - 1) \times (2 - 1) = 1\)

To be compared to the critical value of the \(\chi^2\) distribution

qchisq(0.95, 1)
[1] 3.841459

we reject \(H_0\)

gender and school type are not independent

Exact p-value?

\(\chi^2\) test in R

uses matrix

m <- matrix(c(77, 52, 54, 65), nrow = 2, byrow = TRUE)
m
     [,1] [,2]
[1,]   77   52
[2,]   54   65
chisq.test(m, correct = FALSE)
    Pearson's Chi-squared test

data:  m
X-squared = 5.0877, df = 1, p-value = 0.0241

Warning

if at least one expected value is < 5

  • use Fisher’s exact test fisher.test()

Generalized Linear Models

GLMs (glims)

Are an extension of the linear model where the modelling of error is not Gaussian.

Actually, lm is equivalent to glm(family = "gaussian")

lm vs glm-gaussian

lm(mpg ~ poly(wt, 2), data = mtcars)
Call:
lm(formula = mpg ~ poly(wt, 2), data = mtcars)

Coefficients:
 (Intercept)  poly(wt, 2)1  poly(wt, 2)2  
      20.091       -29.116         8.636  
glm(mpg ~ poly(wt, 2), data = mtcars, family = "gaussian")
Call:  glm(formula = mpg ~ poly(wt, 2), family = "gaussian", data = mtcars)

Coefficients:
 (Intercept)  poly(wt, 2)1  poly(wt, 2)2  
      20.091       -29.116         8.636  

Degrees of Freedom: 31 Total (i.e. Null);  29 Residual
Null Deviance:      1126 
Residual Deviance: 203.7    AIC: 158

GLM, fitting and link

Species count

ANCOVA

response is a count of the number of plant species that have different biomass (a continuous covariate) and different soil pH (categorical variable with 3 levels) Michael Crawley

Species count

ANCOVA

species <- species %>%
  # high as 1st factor
  mutate(PH = fct_inorder(PH))
ggplot(species, aes(x = BIOMASS, 
                    y = RICHNESS, 
                    colour = PH)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE)

warning

  • SPECIES is bounded at 0
  • \(log(0)\) is a no go
  • lm will predict negative richness for high biomass

interaction

  • does BIOMASS and pH interact?
  • could be fooled by linear-space
  • perform both models

Poisson distribution

variance = mean

crossing(tibble(x = 1:25),
               tibble(lambda = c(2, 5, 10, 15))) %>%
  mutate(dpoi = dpois(x, lambda = lambda)) %>%
  ggplot(aes(x, dpoi, fill = factor(lambda))) + geom_col(position = "dodge", width = 2.2, alpha = 0.8) +
  labs(x = NULL, fill = "lambda") + theme_classic(16)

below 5, the distribution is asymmetrical, since bounded at 0

Poisson distribution, increasing variance

animation

GLM with Poisson errors

Species example

the full model, with interaction. Diagnostic plots with the boot library.

glm1 <- glm(RICHNESS ~ BIOMASS * PH, data = species, family = "poisson")
boot::glm.diag.plots(glm1)

GLM with Poisson errors, model simplification

Species example

Without interaction

glm2 <- glm(RICHNESS ~ BIOMASS + PH, data = species, family = "poisson")
boot::glm.diag.plots(glm2)

GLM with Poisson errors, model simplification

Species example

Check if interaction is necessary

anova(glm1, glm2, test = "Chisq")
Analysis of Deviance Table

Model 1: RICHNESS ~ BIOMASS * PH
Model 2: RICHNESS ~ BIOMASS + PH
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        84     83.201                          
2        86     99.242 -2   -16.04 0.0003288 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F test is not appropriate

use the \(\chi^2\) test instead

GLM with Poisson errors

Species example

With interaction eventually. Mind the link to be in the response scale

glm1 %>% augment(type.predict = "response") %>%
  inner_join(glm2 %>% broom::augment(type.predict = "response"), 
             by = c("RICHNESS", "BIOMASS", "PH"), suffix = c("_glm1", "_glm2")) %>%
  gather(interaction, fitted, starts_with(".fitted")) %>% 
  mutate(interaction = recode(interaction, .fitted_glm1 = "with", .fitted_glm2 = "without")) %>%
  ggplot(aes(BIOMASS, RICHNESS, colour = PH)) + geom_point() +
  geom_line(aes(y = fitted, linetype = interaction)) 

Look at interaction in log space

Species example

linear space was misleading, in log space with the fullrange option, we see that pH:low is off

ggplot(species, aes(x = BIOMASS, y = log(RICHNESS), colour = PH)) +
  geom_point() + geom_smooth(method = "lm", fullrange = TRUE) +
  scale_x_continuous(expand = c(0, 0))

GLM’s regression happened in log space

If you don’t change the link, we need to exponentiate values to return to the response scale

glm1 %>% augment() %>%
  ggplot(aes(BIOMASS, RICHNESS)) + geom_point(aes(colour = PH)) +
  geom_line(aes(y = exp(.fitted), colour = PH)) 

For model with interaction, add the CI95%

(qt_pH <- qt(0.975, glance(glm1)$df.residual))
[1] 1.98861
glm1 %>%
  augment(type.predict = "response", se_fit = TRUE) %>%
  ggplot(aes(BIOMASS, RICHNESS)) +
  geom_point(aes(colour = PH)) +
  geom_ribbon(aes(
    ymin = .fitted - qt_pH * .se.fit,
    ymax = .fitted + qt_pH * .se.fit,
    fill = PH), 
    alpha = 0.3) +
  geom_line(aes(y = .fitted, colour = PH)) +
  scale_x_continuous(expand = c(0, 0))

GLM with Poisson errors, coefficients

intercepts

tidy(glm1)
# A tibble: 6 x 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     3.77      0.0615     61.2  0.      
2 BIOMASS        -0.107     0.0125     -8.58 9.72e-18
3 PHmid          -0.331     0.0922     -3.60 3.23e- 4
4 PHlow          -0.816     0.103      -7.93 2.18e-15
5 BIOMASS:PHmid  -0.0319    0.0231     -1.38 1.67e- 1
6 BIOMASS:PHlow  -0.155     0.0400     -3.87 1.08e- 4
# intercept for pH high
exp(3.76812359)
[1] 43.29874
# intercept for pH mid
exp(3.76812359 - 0.33146257)
[1] 31.083
# intercept for pH low
exp(3.76812359 - 0.81557124)
[1] 19.15478

GLM with Poisson errors, coefficients

slopes pH = high

\(log(RICHNESS) = \beta_0 + (\beta_1 + \beta_{interaction}) \times BIOMASS\)

  • intercept pH high = 3.76812359
  • slope BIOMASS = - 0.10712979
glm1 %>% augment(type.predict = "response") %>%
  ggplot(aes(BIOMASS, RICHNESS)) + geom_point(aes(colour = PH)) + 
  geom_line(aes(y = .fitted, colour = PH)) +
  geom_line(aes(x = BIOMASS, y = exp(3.76812359 - 0.10712979 * BIOMASS)))

GLM with Poisson errors, coefficients

slopes pH = mid

glm1 %>% augment(type.predict = "response") %>%
  ggplot(aes(BIOMASS, RICHNESS)) + geom_point(aes(colour = PH)) + 
  geom_line(aes(y = .fitted, colour = PH)) +
  geom_line(aes(x = BIOMASS, y = exp((3.76812359 - 0.33146257) + ((-0.10712979 - 0.03189202) * BIOMASS))))

GLM with Poisson errors, coefficients

slopes pH = low

glm1 %>% augment(type.predict = "response") %>%
  ggplot(aes(BIOMASS, RICHNESS)) + geom_point(aes(colour = PH)) + 
  geom_line(aes(y = .fitted, colour = PH)) +
  geom_line(aes(x = BIOMASS, y = exp((3.76812359 - 0.81557124) + ((-0.10712979 - 0.15502818) * BIOMASS))))

Logistic regression

Binary response

for such responses, we have typically 0 / 1, Presence / Absence, True / False

M. Logan, page 486

Warning

  • lot of 0!
  • no normality
  • predictions outside range

Binary response

isolation example

response is a incidence, 1 island is occupied, 0 means bird did not breed there. One explanatory: isolation distance from mainland (km) Michael Crawley

Warning

  • INCIDENCE is binary
  • \(log(0)\) is a no go
  • lm will predict just bad

GLMs with binomial errors

no interaction

mbino <- glm(INCIDENCE ~ ISOLATION, family = "binomial", data = isolation)
boot::glm.diag.plots(mbino)

GLMs with binomial errors

plotting

mbino %>% augment(type.predict = "response", se_fit = TRUE) %>%
  ggplot(aes(x = ISOLATION, y = INCIDENCE), alpha = 0.6) +
  geom_point() +
  geom_line(aes(y = .fitted), colour = "blue") +
  geom_ribbon(aes(ymin = .fitted - 1.96 * .se.fit, ymax = .fitted + 1.96 * .se.fit), 
    alpha = 0.3)

PCA

unsupervised learning

Unsupervised learning

We have nothing to predict, no response variables. Only a bunch of predictors.

Goal

  • find interesting predictors
  • way to visualize the data
  • reducing dimensions

Challenges

  • no guidance
  • so the unsupervised problem
  • so no validation!

intuition behind PCA

François Husson

FactoMiner R package developed by F. Husson to perform PCA

Principal Components

Figure out the dimensions in which the data are highly variables.

Components

  • green is the first one
  • blue is the second one
  • must be orthogonal
  • but actually n components

Pincipal Components Analysis

rotation

data are rotated, so the first component becomes parallel to the x-axis

Plane rotation

James, D. Witten, T. Hastie and R. Tibshirani (2013)

Finding largest variance

Warning

  • predictors must be scaled

Once all predictors are zero-centered, find largest variance, find the orthogonal one and so on.

Scaling

  • scaling predictors so standard deviation is 1 is optional but recommended

Scaling, example

James, D. Witten, T. Hastie and R. Tibshirani (2013)

Proportion of variance explained

screeplot

James, D. Witten, T. Hastie and R. Tibshirani (2013)

PCA, temperature example

François Husson

instructions

  • load the temperature dataset
    • mind the separator
  • for FactoMineR::PCA(), the cities must be the rownames
    • convert the tibble to a data.frame to avoid the warning
    • move the column city as rownames
  • display the dataset, see which column are to be included in the PCA

example and dataset from François Husson

Temperature dataset, first 6 cities

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Ann Amp Lat Lon Are
Amsterdam 2.9 2.5 5.7 8.2 12.5 14.8 17.1 17.1 14.5 11.4 7.0 4.4 9.9 14.60 (#05) 52.2 4.5 West
Athens 9.1 9.7 11.7 15.4 20.1 24.5 27.4 27.2 23.8 19.2 14.6 11.0 17.8 18.30 (#03) 37.6 23.5 South
Berlin -0.2 0.1 4.4 8.2 13.8 16.0 18.3 18.0 14.4 10.0 4.2 1.2 9.1 18.50 (#02) 52.3 13.2 West
Brussels 3.3 3.3 6.7 8.9 12.8 15.6 17.8 17.8 15.0 11.1 6.7 4.4 10.3 14.40 (#06) 50.5 4.2 West
Budapest -1.1 0.8 5.5 11.6 17.0 20.2 22.0 21.3 16.9 11.3 5.1 0.7 10.9 23.10 (#01) 47.3 19.0 East
Copenhagen -0.4 -0.4 1.3 5.8 11.1 15.4 17.1 16.6 13.3 8.8 4.1 1.3 7.8 17.50 (#04) 55.4 12.3 North

PCA, temperature example

François Husson

instructions

  • install the FactoMineR package
  • perform the PCA, save as res
    • be sure to scale the variances
    • quali.sup is column 17
    • quanti.sup are col 13 to 16
    • ind.sup are 24 to 35
  • install the package factoextra for ggplot2 outputs
  • look at the screeplot, variance explained per axes
  • use function fviz_screeplot()

PCA, temperature example

library(FactoMineR)
library(factoextra)
res <- PCA(temp, ind.sup = 24:35, 
           quanti.sup = 13:16,
           quali.sup = 17,
           scale.unit = TRUE,
           graph = FALSE)
fviz_screeplot(res)

PCA, temperature example

François Husson

instructions

  • display the variables
  • use function fviz_pca_var()
    • color by \(cos^2\) (col.var = "cos2")

PCA, temperature example

fviz_pca_var(res, 
             repel = TRUE,
             gradient.cols = c("#00AFBB", 
                               "#E7B800", 
                               "#FC4E07"),
             col.var = "cos2")

PCA, temperature example

François Husson

instructions

  • display individuals (cities)
  • use function fviz_pca_ind
    • on the res object as 1st arg
    • and habillage = 17 as option for regions
    • set addEllipses to TRUE,
    • use ellipse.type = "confidence"
    • and ellipse.level = 0.95

PCA, temperature example

fviz_pca_ind(res,
             habillage = 17,
             ellipse.type = "confidence",
             ellipse.level = 0.95,
             addEllipses = TRUE,
             repel = TRUE)

PCA, temperature example

François Husson

instructions

  • display PCA biplot, var and ind together
  • use function fviz_pca
    • on the res object as 1st arg
    • display cities with a good projection (\(cos^2 >= 0.96\))
    • and habillage = 17 as option for regions
    • set addEllipses to FALSE

PCA, temperature example

fviz_pca(res,
         select.ind = list(cos2 = 0.96),
         habillage = 17,
         repel = TRUE)

Scatterplot, 2 main variables

instructions

  • find out, which are the 2 main variables
    • seek them in the quali.sup
    • along the dimensions 1 & 2
    • greatly projected
  • convert the data.frame back to a tidy tibble
    • i. e move back the rownames to its column
  • scatterplot them with points
    • with colored by Area
    • use the package ggrepel to label the dots with city names
  • conclude on how the PCA helped us to find the main variables

library(ggrepel)
temp %>% 
  rownames_to_column(var = "city") %>% 
  ggplot(aes(x = Annual, 
             y = Amplitude,
             colour = Area)) + 
  geom_point() +
  theme(legend.position = c(.99, .99),
        legend.justification = c(1, 1)) +
  geom_text_repel(aes(label = city),
                  show.legend = FALSE)

see why ggrepel is great try replacing geom_text_repel() by:

geom_text(aes(label = city), nudge_x = 0.5, nudge_y = 0.5)

Before we stop

Acknowledgments

Some of the figures in this presentation are taken from:

  • Biostatistic Design and Analysis Using R. (Wiley-Blackwell, 2011) Murray Logan
  • An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

People

  • Hákon Jónsson
  • François Husson