---
title: "My answers"
author: "My name"
date: '`r format(Sys.time(), "%d %B, %Y")`'
output: html_document
---


> Objective: summarise a large transcriptomics study using linear regressions

## Yeast data


In [2008, Brauer et al.](http://www.molbiolcell.org/content/19/1/352.abstract) used microarrays to test the effect of starvation on the growth rate of yeast. 
For example, they tried limiting the yeast's supply of glucose (sugar to metabolize into energy), of leucine (an essential amino acid), or of ammonia (a source of nitrogen).


### Retreive the data

- Download [`Brauer2008_DataSet1.tds`](data/Brauer2008_DataSet1.tds) inside a `data` folder you should create.

Load the [`Brauer2008_DataSet1.tds`] file as a `tibble` named `original_data`. 
This is the exact data that was published with the paper (though for some reason the link on the journal's page is broken).
It thus serves as a good example of tidying a biological dataset "found in the wild".


```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


### Tidying the data

Have a look at the dataset. Is the data "tidy"?

```{r}
# Write your answer here
```

```{r}
# Write your answer here
```


#### Many variables are stored in one column NAME

* **Gene name** e.g. _SFB2_. Note that not all genes have a name.
* **Biological process** e.g. "proteolysis and peptidolysis"
* **Molecular function** e.g. "metalloendopeptidase activity"
* **Systematic ID** e.g. _YNL049C_. Unlike a gene name, every gene in this dataset has a systematic ID.
* **Another ID number** e.g. `1082129`. We don't know what this number means, and it's not annotated in the paper. Oh, well.

1. Use the appropriate function provided in the `tidyr` library to split these values and generate a column for each variable. Preferred names for new columns are

* "name"
* "BP"
* "MF"
* "systematic_name"
* "number"

```{block, opts.label = "tip"}
Special characters such as pipes (`|`) must be despecialized as they have a specific meaning. 
In R, you **have** to use 2 backslashes like `\\|` for one pipe `|`
```


```{r}
# Write your answer here
```


2. Once you separated the variables delimited by two "`||`", check closer the new values: You will see that they might start and/or end with white-spaces which might be inconvenient during the subsequent use.
    + To remove these white-spaces, R base provides a function called `trimws()`. Let's test how the function works:
    + `dplyr` allows us to apply a function (in our case `trimws()`) to all columns. In other words, we would like to modify the content of each column with the output of the function `trimws()`. How can you achieve this? Save the result in a `tibble` called `cleaned_data`.

<!-- Moving chunk down as it breaks the indentation in HTML (Is it working in Pdf?) -->
```{r trimws demo}
# Creating test string with whitespaces:
s <- "  Removing whitespaces at both ends "
s
trimws(s)
```

```{r}
# Write your answer here
```


3. We are not going to use every column of the dataframe. Remove the unnecessary columns: `number`, `GID`, `YORF` and `GWEIGHT`.Save as `cleaned_data`

```{r}
# Write your answer here
```


Look at the column names.  
Do you think that our dataset is now "tidy"?

```{r}
# Write your answer here
```


#### Column headers are values, not variable names

+ Keep care to build a tibble with each column representing a variable: 
At this point we are storing the sample name (will contain `G0.05` ...) as a different column `sample` associated to values in `expression` column. Save as `cleaned_data_melt`

```{r}
# Write your answer here
```


Now look at the content of the `sample` column. We are again facing the problem that two variables are stored in a single column. The `nutrient` in the first character, then the `growth rate`.  

```{r}
# Write your answer here
```


Use the same function as before to split the `sample` column into two variables `nutrient` and `rate` (use the appropriate delimitation in `sep`. Save as `cleaned_data_melt`

```{block, opts.label = "tip"}
Consider using the `convert` argument. It allows to convert strings to number when relevant like here.
```

```{r}
# Write your answer here
```


## 1.3 Turn nutrient letters into more comprehensive words

Right now, the nutrients are designed by a single letter. It would be nice to have the full word instead.
One could use a full mixture of `if` and `else` such as `if_else(nutrient == "G", "Glucose", if_else(nutrient == "L", "Leucine", etc ...))`
But, that would be cumbersome. 

using the following correspondences and `dplyr::recode()`, recode all nutrient names with their full explicit names.


```{r}
# Write your answer here
```


#### Cleaning up missing data

Two variables must be present for the further analysis:

- gene expression named as `expression`
- systematic id named as `systematic_name`

delete observations that are missing any of the two mandatory variables. How many rows did you remove?

```{r}
# Write your answer here
```


### 2 Representing the data

Tidying the data is a crucial step allowing easy handling and representing.

#### 2.1 Plot the expression data of the _LEU1_ gene

Extract the data corresponding to the gene called _LEU1_ and draw a line for each nutrient showing the expression in function of the growth rate.

```{r}
# Write your answer here
```


#### 2.2 Plot the expression data of a biological process

For this, we don't need to filter by single gene names as the raw data provides us some information on the biological process for each gene.  
Extract all the genes in the **leucine biosynthesis** process and plot the expression in function of the growth rate for each nutrient.

```{r}
# Write your answer here
```


#### 2.3 Perform a linear regression in top of the plots

Let's play with the graph a little more. These trends look vaguely linear.  
Add a linear regression with the appropriate `ggplot2` function and carefully adjust the `method` argument.

```{r}
# Write your answer here
```


#### 2.4 Switch to another biological process

Once the dataset is tidy, it is very easy to switch to another biological process.
Instead of the "leucine biosynthesis", plot the data corresponding to **sulfur metabolism**.

```{block, opts.label = "tip"}
you can combine the facet headers using `+` in `facet_wrap()`. Adding the systematic name allows to get a name when the gene name is missing.
```

```{r}
# Write your answer here
```


### 3. Linear models

We can see that most genes follow a linear trend under a starvation stress. Instead of manipulating all data points, we would like to estimate just the 2 parameters of a linear model.
What are they? And what do they represent for this experiment?

```{r}
# Write your answer here
```


#### 3.1 Perform all linear models

Perform a linear regression of `expression` explained by `rate` for all group of name, systematic_name and nutrient.
By sure to convert the linear models to a `tibble` using `broom` as saved as `cleaned_lm`

```{block, opts.label = "warning"}
The computation takes ~ 60 sec on my macbook pro. For test purposes, you may want to subsample per group a small number of genes
```

```{block,  opts.label = "tip"}
the appropriate workflow is to **group** the data, **nest** them, then add a column with the linear models objects.
From this model column, you can extract the coefficients, r-square etc using the broom functions
```

```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


- How many models did you perform?
```{r}
# Write your answer here
```


- bonus question: you have observed a warning `essentially perfect fit: summary may be unreliable`, why? How can we clean up the dataset to avoid it?

```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


#### 3.2 Explore models, intercept values

For the gene _LEU1_, let's look again at the linear models per nutrient and add as a dashed black line for mean of all intercepts.

```{block,  opts.label = "tip"}
from the main tibble, you could remove the `data` and `model` column and unnest the `tidy` one
```

```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


We can see that when **leucine** is the limiting factor, the yeast expresses massively this gene _LEU1_ to compensate.
Remember that all experiments are performed at constant growth rates in this chemostat. 

Now, compute the difference between the intercept and the mean(intercept) per systematic name, so for all nutrient per gene.
And select the top 20 highest intercept values to their means. Save as `top_20_intercept`

```{r}
# Write your answer here
```


- join those 20 genes with the `cleaned_data_melt` by the `systematic_name` to retrieved the original data points
- plot the data points and linear trends for those top 20 genes. Instead of the systematic name in the header, use the molecular function (`MF`)

```{r}
# Write your answer here
```


- what can conclude?

```{r}
# Write your answer here
```


#### 3.3 Explore models, slope estimates

We can now have a look at the slope estimates.

- first, filter the term for slope estimate `rate` and remove _p-value_ that are missing. Save as `rate_slopes`

```{r}
# Write your answer here
```


- display the _p-values_ histograms per nutrient

```{r}
# Write your answer here
```


- looking at those histograms, how do you estimate the ratio of _null_ and _alternative_ hypotheses?
```{r}
# Write your answer here
```

- how can retrieved the genes that belong most probably to the _alternative hypothesis?
```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


- display the histograms of both _p_ and _q_ _values_ per nutrient

```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


### Empirical bayesian estimation


```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


```{r}
# Write your answer here
```


```{r}
# Write your answer here
```