You will learn:
- what is a model?
model categories depending on data
- perform linear models
- assess linear models
know when you should move beyond linearity
Reading
- An Introduction to Statistical Learning by James, Witten, Hastie & Tibshirani
November 2019
model categories depending on data
know when you should move beyond linearity
\[Y = f(X) + \epsilon\]
where:
\[\hat{Y} = \hat{f}(X)\]
Goal is to minimized: \[E(Y - \hat{Y})^2 = [f(X) - \hat{f}(X)]^2 + Var(\epsilon)\]
where \(Var(\epsilon)\) is the irreductible error.
To estimate \(f\), one usually uses a limited number of observations.
this purple \(\hat{f}\) is different from the true blue \(f\)
The first step of the parametric approach is done:
The general formula for p parameters gives p coefficients (slope) and \(\beta_0\) the intercept
\[f(X) = \beta_0 + \beta_1 \times X_1 + \beta_2 \times X_2 + \ldots + \beta_p \times X_p\]
The model tries to minimize the reducible error, i. e the Sum of Squares of residuals
source: Logan M.
\[RSS = \sum_{i=1}^n{(y_i - \hat{y_i})^2}\]
estimate the intercept and the slope, \(\Rightarrow\) is the response increase for 1 unit of the predictor.
read_csv("data/Advertising.csv", col_types = cols()) %>% lm(Sales ~ TV, data = .) %>% summary()
Call: lm(formula = Sales ~ TV, data = .) Residuals: Min 1Q Median 3Q Max -8.3860 -1.9545 -0.1913 2.0671 7.2124 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.032594 0.457843 15.36 <2e-16 *** TV 0.047537 0.002691 17.67 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.259 on 198 degrees of freedom Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099 F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
\[RSE = \sqrt{\frac{RSS}{n - 2}}\] why \(-2\) ?
\[SE(\hat{\beta_{1}})^2 = \frac{RSE}{\sum_{i=1}^n(x_i - \hat{x})^2}\]
with this standard error, one per coefficient, we can compute confidence intervals
read_csv("data/Advertising.csv") %>% lm(Sales ~ TV, data = .) %>% # tidy linear model using broom::tidy() into tibbles tidy() %>% mutate(CI95 = std.error * qt(0.975, df = (200 - 2)), CI.L = estimate - CI95, CI.R = estimate + CI95)
Parsed with column specification: cols( obs = col_double(), TV = col_double(), Radio = col_double(), Newspaper = col_double(), Sales = col_double() )
# A tibble: 2 x 8 term estimate std.error statistic p.value CI95 CI.L CI.R <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 7.03 0.458 15.4 1.41e-35 0.903 6.13 7.94 2 TV 0.0475 0.00269 17.7 1.47e-42 0.00531 0.0422 0.0528
and the t-statistic for each coefficient \[t = \frac{\hat{\beta_1} - 0}{SE(\hat{\beta_1})}\]
RSE is a measure of the lack of fit
But, depends on sample size (units of Y)
\[R^2 = 1- \frac{RSS}{\sum(y_i - \bar{y})^2} = 1- \frac{RSS}{TSS}\]
the correlation coefficient, \(r^2\), is identical to the \(R^2\) and so the p values
Both regression and correlation examine the strength and signifiance of a relationship, regression analysis also explores the functional nature of the relationship Murray Logan
cor(mtcars$mpg, mtcars$wt)
[1] -0.8676594
cor(mtcars$mpg, mtcars$wt, method = "spearman")
[1] -0.886422
cor(mtcars$mpg, mtcars$wt, method = "kendall")
[1] -0.7278321
source: Logan M.
par(mfrow = (c(2, 2))) plot(lm(mpg ~ wt, data = mtcars))
library(ggfortify) autoplot(lm(mpg ~ wt, data = mtcars))
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
what about a car that weight 4,500 lb (~ 2 tons) ?
fit <- lm(mpg ~ wt, data = mtcars) # create linear model
Use the function coef
to extract parameters slope and intercept
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + # new unknown car geom_vline(xintercept = 4.5) + geom_hline(yintercept = coef(fit)[1] + coef(fit)[2] * 4.5, linetype = "dashed")
coef(fit)[1] + coef(fit)[2] * 4.5
(Intercept) 13.235
predict(fit, data.frame(wt = c(3, 4.5)))
1 2 21.25171 13.23500
geom_abline()
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_abline(slope = coef(fit)[2], intercept = coef(fit)[1]) + geom_smooth(method = "lm", se = FALSE, linetype = "dashed")
augment
broom
functionstidy(fit)
, coefficient estimates and hypotheses testingglance(fit)
, summary of the modelaugment
head(augment(fit), 20)
# A tibble: 20 x 10 .rownames mpg wt .fitted .se.fit .resid .hat .sigma .cooksd <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazda RX4 21 2.62 23.3 0.634 -2.28 0.0433 3.07 1.33e-2 2 Mazda RX… 21 2.88 21.9 0.571 -0.920 0.0352 3.09 1.72e-3 3 Datsun 7… 22.8 2.32 24.9 0.736 -2.09 0.0584 3.07 1.54e-2 4 Hornet 4… 21.4 3.22 20.1 0.538 1.30 0.0313 3.09 3.02e-3 5 Hornet S… 18.7 3.44 18.9 0.553 -0.200 0.0329 3.10 7.60e-5 6 Valiant 18.1 3.46 18.8 0.555 -0.693 0.0332 3.10 9.21e-4 7 Duster 3… 14.3 3.57 18.2 0.573 -3.91 0.0354 3.01 3.13e-2 8 Merc 240D 24.4 3.19 20.2 0.539 4.16 0.0313 3.00 3.11e-2 9 Merc 230 22.8 3.15 20.5 0.540 2.35 0.0314 3.07 9.96e-3 10 Merc 280 19.2 3.44 18.9 0.553 0.300 0.0329 3.10 1.71e-4 11 Merc 280C 17.8 3.44 18.9 0.553 -1.10 0.0329 3.09 2.30e-3 12 Merc 450… 16.4 4.07 15.5 0.719 0.867 0.0558 3.09 2.53e-3 13 Merc 450… 17.3 3.73 17.4 0.610 -0.0502 0.0401 3.10 5.92e-6 14 Merc 450… 15.2 3.78 17.1 0.624 -1.88 0.0419 3.08 8.73e-3 15 Cadillac… 10.4 5.25 9.23 1.26 1.17 0.170 3.09 1.84e-2 16 Lincoln … 10.4 5.42 8.30 1.35 2.10 0.195 3.07 7.19e-2 17 Chrysler… 14.7 5.34 8.72 1.31 5.98 0.184 2.84 5.32e-1 18 Fiat 128 32.4 2.2 25.5 0.783 6.87 0.0661 2.80 1.93e-1 19 Honda Ci… 30.4 1.62 28.7 1.05 1.75 0.118 3.08 2.49e-2 20 Toyota C… 33.9 1.84 27.5 0.942 6.42 0.0956 2.83 2.60e-1 # … with 1 more variable: .std.resid <dbl>
# create linear model fit <- lm(mpg ~ wt, data = mtcars) # try this command: predict(fit)[1:10]
predict
call the predicted fuel consumption for the original car weights that were used to built the model.using predict without arguments
predict(fit)
Mazda RX4 Mazda RX4 Wag Datsun 710 23.282611 21.919770 24.885952 Hornet 4 Drive Hornet Sportabout Valiant 20.102650 18.900144 18.793255 Duster 360 Merc 240D Merc 230 18.205363 20.236262 20.450041 Merc 280 Merc 280C Merc 450SE 18.900144 18.900144 15.533127 Merc 450SL Merc 450SLC Cadillac Fleetwood 17.350247 17.083024 9.226650 Lincoln Continental Chrysler Imperial Fiat 128 8.296712 8.718926 25.527289 Honda Civic Toyota Corolla Toyota Corona 28.653805 27.478021 24.111004 Dodge Challenger AMC Javelin Camaro Z28 18.472586 18.926866 16.762355 Pontiac Firebird Fiat X1-9 Porsche 914-2 16.735633 26.943574 25.847957 Lotus Europa Ford Pantera L Ferrari Dino 29.198941 20.343151 22.480940 Maserati Bora Volvo 142E 18.205363 22.427495
\[residuals = original - predicted\]
my_residuals <- mtcars$mpg - predict(fit) head(my_residuals)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive -2.2826106 -0.9197704 -2.0859521 1.2973499 Hornet Sportabout Valiant -0.2001440 -0.6932545
R
has a residuals
functionthat does the job
head(residuals(fit))
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive -2.2826106 -0.9197704 -2.0859521 1.2973499 Hornet Sportabout Valiant -0.2001440 -0.6932545
See also summary(residuals(fit))
that shows the upper part of summary(fit)
\[RSS = \sum_{i=1}^n{(y_i - \hat{y_i})^2}\]
sum(residuals(fit)^2)
[1] 278.3219
library(broom) glance(fit)
# A tibble: 1 x 11 r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 0.753 0.745 3.05 91.4 1.29e-10 2 -80.0 166. 170. # … with 2 more variables: deviance <dbl>, df.residual <int>
tibble(fitted = predict(fit), residuals = residuals(fit)) %>% ggplot() + geom_point(aes(fitted, residuals))
tibble(x = residuals(fit)) %>% ggplot() + stat_qq(aes(sample = x))
We want the predicted values in red, residuals in dashed black
Minimizing the error sum of squares does not resolve everything.
anscombe
x1 x2 x3 x4 y1 y2 y3 y4 1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89
Is this data.frame tidy?
anscombe_tidy <- anscombe %>% pivot_longer( everything(), # .value is a placeholder that # will received x and y names_to = c(".value", "set"), # regular expression to capture names_pattern = "(.)(.)" ) %>% as_tibble() anscombe_tidy
# A tibble: 44 x 3 set x y <chr> <dbl> <dbl> 1 1 10 8.04 2 2 10 9.14 3 3 10 7.46 4 4 8 6.58 5 1 8 6.95 6 2 8 8.14 7 3 8 6.77 8 4 8 5.76 9 1 13 7.58 10 2 13 8.74 # … with 34 more rows
broom::tidy
to get a tibble
backanscombe_tidy %>% group_nest(set) %>% mutate(model = map(data, ~ lm(y ~ x, data = .x)), tidy = map(model, tidy)) %>% unnest(tidy, .drop = TRUE)
Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0. All list-columns are now preserved. This warning is displayed once per session. Call `lifecycle::last_warnings()` to see where this warning was generated.
# A tibble: 8 x 8 set data model term estimate std.error statistic p.value <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl> 1 1 <tibble [11 ×… <lm> (Interce… 3.00 1.12 2.67 0.0257 2 1 <tibble [11 ×… <lm> x 0.500 0.118 4.24 0.00217 3 2 <tibble [11 ×… <lm> (Interce… 3.00 1.13 2.67 0.0258 4 2 <tibble [11 ×… <lm> x 0.5 0.118 4.24 0.00218 5 3 <tibble [11 ×… <lm> (Interce… 3.00 1.12 2.67 0.0256 6 3 <tibble [11 ×… <lm> x 0.500 0.118 4.24 0.00218 7 4 <tibble [11 ×… <lm> (Interce… 3.00 1.12 2.67 0.0256 8 4 <tibble [11 ×… <lm> x 0.500 0.118 4.24 0.00216
compare the \(R^2\) using glance
anscombe_tidy %>% group_nest(set) %>% mutate(model = map(data, ~ lm(y ~ x, data = .x)), glance = map(model, glance)) %>% unnest(glance, .drop = TRUE)
# A tibble: 4 x 14 set data model r.squared adj.r.squared sigma statistic p.value df <chr> <lis> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <int> 1 1 <tib… <lm> 0.667 0.629 1.24 18.0 0.00217 2 2 2 <tib… <lm> 0.666 0.629 1.24 18.0 0.00218 2 3 3 <tib… <lm> 0.666 0.629 1.24 18.0 0.00218 2 4 4 <tib… <lm> 0.667 0.630 1.24 18.0 0.00216 2 # … with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, # deviance <dbl>, df.residual <int>
all \(R^2\) are identical!
ggplot(anscombe_tidy, aes(x, y)) + geom_point() + facet_wrap(~ set, ncol = 2) + geom_smooth(method = "lm", se = FALSE)
Luc Buissière has a nice way of saying it, from his former teacher Prof. Glenn Morris Univ. of Toronto:
set.seed(123) x <- seq(1, 10, length.out = 100) # our predictor # our response; a function of x plus some random noise map_dfr(seq(0.5, 20, length.out = 20), ~ lm(2 + 1.2 * x + rnorm(100, 0, sd = .x) ~ x) %>% broom::glance()) %>% mutate(sigma = seq(0.5, 20, length.out = 20)) %>% ggplot(aes(x = sigma, y = r.squared)) + geom_point(size = 2) + geom_line()
rlm
(M-estimator Huber) weigths the outliersMASS
library(MASS) anscombe_tidy %>% group_nest(set) %>% mutate(model = map(data, ~ rlm(y ~ x, data = .x)), tidy = map(model, tidy)) %>% unnest(tidy)
# A tibble: 8 x 7 set data model term estimate std.error statistic <chr> <list> <list> <chr> <dbl> <dbl> <dbl> 1 1 <tibble [11 × 2]> <rlm> (Intercept) 2.98 1.45 2.06 2 1 <tibble [11 × 2]> <rlm> x 0.506 0.152 3.34 3 2 <tibble [11 × 2]> <rlm> (Intercept) 3.06 1.31 2.33 4 2 <tibble [11 × 2]> <rlm> x 0.5 0.137 3.64 5 3 <tibble [11 × 2]> <rlm> (Intercept) 4.00 0.00404 990. 6 3 <tibble [11 × 2]> <rlm> x 0.346 0.000424 816. 7 4 <tibble [11 × 2]> <rlm> (Intercept) 3.00 1.26 2.38 8 4 <tibble [11 × 2]> <rlm> x 0.500 0.132 3.80
ggplot(anscombe_tidy, aes(x, y)) + geom_point() + facet_wrap(~ set, ncol = 2) + geom_smooth(method = "rlm", se = FALSE)
boot_set <- function(tib, dataset, rep = 1000) { # per set using tidyeval filter(tib, set == {{dataset}}) %>% # generate 1000 dataset randomly modelr::bootstrap(rep) %>% mutate(model = map(strap, ~ lm(y ~ x, .x)), tidy = map(model, tidy)) %>% unnest(tidy) } #boot_set(anscombe_tidy, "I", 6) my_sets <- set_names(1:4, nm = c("I", "II", "III", "IV")) lm_boot <- map_dfr(my_sets, ~ boot_set(anscombe_tidy, .x), .id = "set")
filter(anscombe_tidy, set == 3) %>% modelr::bootstrap(12) %>% mutate(boot = map(strap, as_tibble)) %>% unnest(boot) %>% ggplot(aes(x = x, y = y)) + geom_point(alpha = 0.6, size = 1.2) + geom_smooth(method = "lm", se = FALSE) + facet_wrap(~ .id)
only I is displaying a normal distribution of its coefficients
crossing(set = 1:4, prop = c(0.25, 0.5)) %>% mutate(cross = map2(set, prop, ~ crossv_mset(tib = anscombe_tidy, dataset = .x, prop = .y)), MSE_slope = map_dbl(cross, "MSE"), MSE_inter = map_dbl(cross, "MSEIntercept")) %>% pivot_longer(cols = starts_with("MSE"), names_to = "term", values_to = "MSE") %>% ggplot(aes(x = set, y = MSE)) + geom_col(aes(fill = factor(prop)), position = "dodge") + facet_wrap(~ term, ncol = 1) + theme(legend.position = "bottom") + labs(x = NULL, fill = "prop of test/train", title = "Anscombe", subtitle = "100 samples per data split")
ref: David Robinson
when trend is obviously non-linear, as seen in the residuals
source: Logan M.
Some of the figures in this presentation are taken from:
An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Biostatistic Design and Analysis Using R. (Wiley-Blackwell, 2011) Murray Logan