In order to tryout linear models in R we are going to use the
blood_fat
dataset which contains the age (in years), weight (in kg) and measured fat concentrations (units were not mentioned) in blood samples of different subjects.
We would like to determine whether the blood concentration in fat is related to the age of the subjects.
csv
) available here in RAdd a regression line (without the confidence interval ribbon).
blood_fat
data frame containing the expected fat levels from the linear model.
geom_segment()
, connect the expected values to the measured values as dotted lines.broom
function after the linear model, to fetch all those information in one tibble
Calculate the slope and intercept of the regression line
draw a dashed lightblue line using explicitly the values you calculated in the previous question
You learned that \(R^2\) can be calculated as follows:
\[R^2 = 1- \frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y})^2} = 1- \frac{RSS}{TSS}\]
Using mutate()
, add the length of each dotted line to the blood_fat
data frame.
What do these length represent? Which functions in R generates these values?
Use geom_segment()
to connect all points representing the real measures to their projection on this horizontal line (using a darkgreen color with alpha = 0.2 and a size of 2).
Using mutate()
, add the length of each green translucid line to the blood_fat
data frame. What do they represent? And the sum of their squared length?
Calculate \(RSS\), \(TSS\) and \(R^2\)
What does the value of \(R^2\) tell you?
ggplot2
ggfortify
and the function autoplot(fit)
can produce the classic 4 diagnostic plots with no efforts
change predictor and use weight
instead of age
to predict the fat concentration
What does the ADF method tells you?
check the summary and diagnostic plots for this regression
Can the blood content in fat be explained by the weight of the subject?
In this exercise, we will use the
diamonds
dataset provided in theggplot2
library. We would like to analyse the relationship between the price of diamonds and their weight (in carats) and limit our study to diamonds with a weight lower or equal to 2.5 carats. this exercise is adapted from Hadley Wickhams example
diamonds2
containing only diamonds with a \(weight \leq 2.5\).
diamonds2
when compared to the original diamonds
data frame?geom_hex()
instead of geom_point()
. It might also be appropriate to override the default number of bins in geom_hex()
(set it for example to 50). You might need to install the package hexbin
if your console tells you so. For the filling, the viridis palette offers a much better alternative to the default
Add a red linear regression line to the plot
Draw the residuals diagnosis plots
log2
).
Create two new columns lcarat
and lprice
containing the log2 transformed weights and prices respectively.
Redraw the density plots using the log2 transformed values
add again a red linear regression line using the log2 transformed values
redraw the diagnosis plots to analyse how the residuals are distributed
geom_line()
to draw such a line
Using the model, calculate the price of a diamond weighting 1.75 carats
Add the point on the previous plot