The UCBAdmissions dataset (which is natively present in R) contains the number of applicants to the graduate school at Berkeley for the six largest departments in fall 1973 (Bickel, Hammel, and O’connell 1975). The data is classified by admission and sex.
The UCBAdmissions dataset is a 3-dimensional array.
UCBAdmissions to a tibble.## # A tibble: 24 x 4
## Admit Gender Dept n
## <chr> <chr> <chr> <dbl>
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
## 7 Admitted Female B 17
## 8 Rejected Female B 8
## 9 Admitted Male C 120
## 10 Rejected Male C 205
## # … with 14 more rows
First, we would like to organise this dataset in order to visualise the overall admittance to UC Berkeley for males and females.
We would like to visualise the data.
geom_col() to represent the data as distinct columns. Use different facets to draw the counts and the proportions on the same plot.Try to adjust the position argument of geom_col to see the difference, stack or fill options
Based on the tables and graphical representations what would you conclude?
To perform the \(\chi^2\) test, we will use the chisq.test() implemented in R.
chisq.test() help page to see how to perform the test. Notice that there are different ways to supply the data to the function.
The help page states that chisq.test() accepts a matrix. In addition, in the details, we can read that if x is a matrix, it is considered as a 2D contingency table.
First, let’s try whether formatting a data frame like a contingency table is suitable as an argument (x) to the chisq.test() function.
chisq.test() expects a matrix. Take care to retain only the numeric columns before supplying your data frame to the function.
The stats package in base R provides the function xtabs() to generate contingency tables that can be further used in the chisq.test() function.
xtabs() expects a formula. Construct the formula as usual: the response on the left hand side and the factor on the right hand side.Generate the contigency table using the formula and xtabs()
Perform the \(\chi^2\) test again using this contigency table.
What is your conclusion?
Now we would like to determine whether the overall acceptance trend is representative of each single department.
ggplot2 will assign red to Admitted.scale_fill_manual())
Note that again, we don’t need to compute the proportion, position = "fill" is doing it for us.
Now, perform the appropriate statistical test on the data for each single department.
Adjust the method you just used to perform the statistical test using purrr in order to run it for each department.
Based on the representations and statistical tests, what is your conclusion?
This apparent discrepancy is known as the Simpson’s paradox. How can we try to explain it?
Visualise the success rate (gender independent) for each department
Visualise the gender preference for the applications to each department
Finally, try to visualise a possible correlation between the women’s application preference and the success rate for the 6 departments:
women_prefgroup_by() instruction on the two key parameters used to calculate the success rate: Dept and Admit. Then, the summarise() function returns \(6 \times 2 = 12\) lines (6 levels for Dept and 2 for Admit).summarise() also peels off one grouping from the right. Afterwards, we can use mutate() to compute the proportions since the data is grouped only by Depth at this stage. The same is done for the women’s preference, except that the grouping key variables are Dept and Gender.
join the women_pref tibble: you will end up with a tibble showing for each department the success rate together with the women’s application preference.
plot the women’s preference in function of the success rate.
test how much the observed trend is supported.
Bickel, P. J., E. A. Hammel, and J. W. O’connell. 1975. “Sex Bias in Graduate Admissions Data from Berkeley.” Science (New York, N.Y.) 187 (4175): 398–404. doi:10.1126/science.187.4175.398.