The UCBAdmissions
dataset (which is natively present in R) contains the number of applicants to the graduate school at Berkeley for the six largest departments in fall 1973 (Bickel, Hammel, and O’connell 1975). The data is classified by admission and sex.
The UCBAdmissions
dataset is a 3-dimensional array.
UCBAdmissions
to a tibble
.## # A tibble: 24 x 4
## Admit Gender Dept n
## <chr> <chr> <chr> <dbl>
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
## 7 Admitted Female B 17
## 8 Rejected Female B 8
## 9 Admitted Male C 120
## 10 Rejected Male C 205
## # … with 14 more rows
First, we would like to organise this dataset in order to visualise the overall admittance to UC Berkeley for males and females.
We would like to visualise the data.
geom_col()
to represent the data as distinct columns. Use different facets to draw the counts and the proportions on the same plot.Try to adjust the position
argument of geom_col
to see the difference, stack
or fill
options
Based on the tables and graphical representations what would you conclude?
To perform the \(\chi^2\) test, we will use the chisq.test()
implemented in R.
chisq.test()
help page to see how to perform the test. Notice that there are different ways to supply the data to the function.
The help page states that chisq.test()
accepts a matrix. In addition, in the details, we can read that if x
is a matrix, it is considered as a 2D contingency table.
First, let’s try whether formatting a data frame like a contingency table is suitable as an argument (x
) to the chisq.test()
function.
chisq.test()
expects a matrix. Take care to retain only the numeric columns before supplying your data frame to the function.
The stats
package in base R provides the function xtabs()
to generate contingency tables that can be further used in the chisq.test()
function.
xtabs()
expects a formula. Construct the formula as usual: the response on the left hand side and the factor on the right hand side.Generate the contigency table using the formula and xtabs()
Perform the \(\chi^2\) test again using this contigency table.
What is your conclusion?
Now we would like to determine whether the overall acceptance trend is representative of each single department.
ggplot2
will assign red to Admitted.scale_fill_manual()
)
Note that again, we don’t need to compute the proportion, position = "fill"
is doing it for us.
Now, perform the appropriate statistical test on the data for each single department.
Adjust the method you just used to perform the statistical test using purrr
in order to run it for each department.
Based on the representations and statistical tests, what is your conclusion?
This apparent discrepancy is known as the Simpson’s paradox. How can we try to explain it?
Visualise the success rate (gender independent) for each department
Visualise the gender preference for the applications to each department
Finally, try to visualise a possible correlation between the women’s application preference and the success rate for the 6 departments:
women_pref
group_by()
instruction on the two key parameters used to calculate the success rate: Dept
and Admit
. Then, the summarise()
function returns \(6 \times 2 = 12\) lines (6 levels for Dept
and 2 for Admit
).summarise()
also peels off one grouping from the right. Afterwards, we can use mutate()
to compute the proportions since the data is grouped only by Depth
at this stage. The same is done for the women’s preference, except that the grouping key variables are Dept
and Gender
.
join the women_pref
tibble: you will end up with a tibble showing for each department the success rate together with the women’s application preference.
plot the women’s preference in function of the success rate.
test how much the observed trend is supported.
Bickel, P. J., E. A. Hammel, and J. W. O’connell. 1975. “Sex Bias in Graduate Admissions Data from Berkeley.” Science (New York, N.Y.) 187 (4175): 398–404. doi:10.1126/science.187.4175.398.