Simple linear model

In order to tryout linear models in R we are going to use the blood_fat dataset which contains the age (in years), weight (in kg) and measured fat concentrations (units were not mentioned) in blood samples of different subjects.

Relation of blood fat content and age

We would like to determine whether the blood concentration in fat is related to the age of the subjects.

load the data (csv) available here in R

Visualization

If the goal would be to guess the fat concentration knowing someones age, find out which are the response and predictor variables and draw the according scatter plot.
Add a regression line (without the confidence interval ribbon).
Add a new column to the blood_fat data frame containing the expected fat levels from the linear model.
- For each subject in the data frame, add these predicted values as red points on your previous plot.
- Using geom_segment(), connect the expected values to the measured values as dotted lines.

Tip

remember to use the appropriate broom function after the linear model, to fetch all those information in one tibble

Calculate the slope and intercept of the regression line
draw a dashed lightblue line using explicitly the values you calculated in the previous question

Calculate \(R^2\)

You learned that \(R^2\) can be calculated as follows:

\[R^2 = 1- \frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y})^2} = 1- \frac{RSS}{TSS}\]

The length of the dotted lines you just represented are related to a term of this equation. Which one?
Using mutate(), add the length of each dotted line to the blood_fat data frame.
What do these length represent? Which functions in R generates these values?
Draw a darkgreen horizontal line showing the mean fat concentration of all measures.
Use geom_segment() to connect all points representing the real measures to their projection on this horizontal line (using a darkgreen color with alpha = 0.2 and a size of 2).
Using mutate(), add the length of each green translucid line to the blood_fat data frame. What do they represent? And the sum of their squared length?

Tip

those lines are deviation to the global mean of the response

Calculate \(RSS\), \(TSS\) and \(R^2\)
Is there really a relationship between blood fat content and age?
What does the value of \(R^2\) tell you?

Checking the residuals of the model

residuals’s mean
- what is the expectation for the residuals’s mean?
- compute the residuals’s mean for the fat explained by age model.
Can the measures appropriately be modelled in this way? Draw two diagnosis plots using ggplot2
- In the first draw the residuals on the y axis and the estimated values on the x axis
- Your second one should be a quantile-quantile plot.

Tip

the package ggfortify and the function autoplot(fit) can produce the classic 4 diagnostic plots with no efforts

Relation of blood fat content and weight

change predictor and use weight instead of age to predict the fat concentration
What does the ADF method tells you?
check the summary and diagnostic plots for this regression
Can the blood content in fat be explained by the weight of the subject?

Linear models and data transformation

In this exercise, we will use the diamonds dataset provided in the ggplot2 library. We would like to analyse the relationship between the price of diamonds and their weight (in carats) and limit our study to diamonds with a weight lower or equal to 2.5 carats. this exercise is adapted from Hadley Wickhams example

Create a data frame diamonds2 containing only diamonds with a \(weight \leq 2.5\).
- How many entries are in this data frame?
- What is the proportion of entries contained in diamonds2 when compared to the original diamonds data frame?
Create a plot showing the price of diamonds being explained by their weight.

Tip

As you figured out, the data set still contains a lot of entries, try to use geom_hex() instead of geom_point(). It might also be appropriate to override the default number of bins in geom_hex() (set it for example to 50). You might need to install the package hexbin if your console tells you so. For the filling, the viridis palette offers a much better alternative to the default

Add a red linear regression line to the plot
Draw the residuals diagnosis plots
What is your conclusion out of these plots?
The hexagon binning plot already showed that the linear regression might not be appropriate and that there is an enrichment in diamonds having a low weight and a low price
- draw a density plot showing how the weights are distributed
- draw a density plot showing how the prices are distributed
To better discriminate lower weight/price values without excluding higher weights/prices, we can try to apply a log transformation (i.e. log2).
- Create two new columns lcarat and lprice containing the log2 transformed weights and prices respectively.
- Redraw the density plots using the log2 transformed values
redraw the first plot (price of diamonds being explained by their weight) using the log2 transformed values
add again a red linear regression line using the log2 transformed values
redraw the diagnosis plots to analyse how the residuals are distributed

what are your conclusions out of these plots?
analyse the output of the linear model in R base
- What are your conclusions?

Getting back to the original linear scale

Now we would like to draw the plot using the raw values (without transformation) but showing the appropriate (back transformed) linear regression
- You were already able, several times, to draw points that are on the regression line.
- Similarly use all these points to draw a connecting red line
- Do not forget to appropriately back transform the predictions

Tip

You might want to use geom_line() to draw such a line

Estimating the price of diamonds

Graphically, what would be the price of a diamond weighting 2 carats?
Using the model, calculate the price of a diamond weighting 1.75 carats
Add the point on the previous plot

TD - simple linear models

Eric Koncina, Aurélien Ginolhac

2019-11-14