TD - The Simpson’s paradox

1 The UCBAdmissions dataset

The UCBAdmissions dataset (which is natively present in R) contains the number of applicants to the graduate school at Berkeley for the six largest departments in fall 1973 (Bickel, Hammel, and O’connell 1975). The data is classified by admission and sex.

The UCBAdmissions dataset is a 3-dimensional array.

Coerce UCBAdmissions to a tibble.

## # A tibble: 24 x 4
##    Admit    Gender Dept      n
##    <chr>    <chr>  <chr> <dbl>
##  1 Admitted Male   A       512
##  2 Rejected Male   A       313
##  3 Admitted Female A        89
##  4 Rejected Female A        19
##  5 Admitted Male   B       353
##  6 Rejected Male   B       207
##  7 Admitted Female B        17
##  8 Rejected Female B         8
##  9 Admitted Male   C       120
## 10 Rejected Male   C       205
## # … with 14 more rows

In order to determine whether there was a discrimination against women during the graduate admissions, which statistical test would you use?

2 Overall analysis

2.1 Contingency table

First, we would like to organise this dataset in order to visualise the overall admittance to UC Berkeley for males and females.

Count the number of admissions and rejections for both genders.

2.2 Proportion table

To render a better overview, instead of the number of occurrences, calculate the proportion of admission for each gender.

2.3 Graphical representations

We would like to visualise the data.

Use geom_col() to represent the data as distinct columns. Use different facets to draw the counts and the proportions on the same plot.

Tip

if you display both absolute numbers and proportions using facets, free the scales to have 2 different y-axes scales.

Try to adjust the position argument of geom_col to see the difference, stack or fill options
Based on the tables and graphical representations what would you conclude?

2.4 Statistical analysis

To perform the \(\chi^2\) test, we will use the chisq.test() implemented in R.

Tip

Have a look at the chisq.test() help page to see how to perform the test. Notice that there are different ways to supply the data to the function.

The help page states that chisq.test() accepts a matrix. In addition, in the details, we can read that if x is a matrix, it is considered as a 2D contingency table.
First, let’s try whether formatting a data frame like a contingency table is suitable as an argument (x) to the chisq.test() function.

Use the count data (you generated before) to create a contingency table (one categorical variable as column names, the remaining one in a column).

chisq.test() expects a matrix. Take care to retain only the numeric columns before supplying your data frame to the function.

The stats package in base R provides the function xtabs() to generate contingency tables that can be further used in the chisq.test() function.

xtabs() expects a formula. Construct the formula as usual: the response on the left hand side and the factor on the right hand side.
Generate the contigency table using the formula and xtabs()
Perform the \(\chi^2\) test again using this contigency table.
What is your conclusion?

3 In-depth analysis

Now we would like to determine whether the overall acceptance trend is representative of each single department.

3.1 Graphical representations

First, similar to your previous representations, show the frequencies and proportions for each of the 6 departments.

Tip

the bar’s colors will be Admitted and Rejected. The default ggplot2 will assign red to Admitted.
You might want to change this behaviour with either reordering the levels or changing the color scale (scale_fill_manual())

Note that again, we don’t need to compute the proportion, position = "fill" is doing it for us.

3.2 Statistical analysis

Now, perform the appropriate statistical test on the data for each single department.

Adjust the method you just used to perform the statistical test using purrr in order to run it for each department.

Based on the representations and statistical tests, what is your conclusion?

3.3 Explaining the apparent discrepancy

This apparent discrepancy is known as the Simpson’s paradox. How can we try to explain it?

Visualise the success rate (gender independent) for each department
Visualise the gender preference for the applications to each department

Finally, try to visualise a possible correlation between the women’s application preference and the success rate for the 6 departments:

compute the preference as the number of applications per Department and Gender
compute the frequency of each number of applications compare to all applications
filter for the women frequencies, saved as women_pref

Tip

Now, we need to compute the proportions. This is done by the group_by() instruction on the two key parameters used to calculate the success rate: Dept and Admit. Then, the summarise() function returns \(6 \times 2 = 12\) lines (6 levels for Dept and 2 for Admit).
It is also very important to recall that summarise() also peels off one grouping from the right. Afterwards, we can use mutate() to compute the proportions since the data is grouped only by Depth at this stage. The same is done for the women’s preference, except that the grouping key variables are Dept and Gender.

compute the sucess rate for each department
join the women_pref tibble: you will end up with a tibble showing for each department the success rate together with the women’s application preference.
plot the women’s preference in function of the success rate.
test how much the observed trend is supported.

What is your final conclusion?

Bickel, P. J., E. A. Hammel, and J. W. O’connell. 1975. “Sex Bias in Graduate Admissions Data from Berkeley.” Science (New York, N.Y.) 187 (4175): 398–404. doi:10.1126/science.187.4175.398.