I want to conduct a statistical inference given a two-way table of frequencies (i.e., contingency table). Thus, I'd like to use the chi-squared test of independence. I've found a very nice streamlined procedure with the {infer}
package from tidymodels
.
However, I cannot follow the tutorial's example given my own data. The tutorial assumes that the starting point is a dataset with two categorical columns (i.e., of type factor
):
library(dplyr, warn.conflicts = FALSE)
data(ad_data, package = "modeldata")
ad_data_gen_class <- ad_data |>
select(Genotype, Class)
ad_data_gen_class
#> # A tibble: 333 x 2
#> Genotype Class
#> <fct> <fct>
#> 1 E3E3 Control
#> 2 E3E4 Control
#> 3 E3E4 Control
#> 4 E3E4 Control
#> 5 E3E3 Control
#> 6 E4E4 Impaired
#> 7 E2E3 Control
#> 8 E2E3 Control
#> 9 E3E3 Control
#> 10 E2E3 Impaired
#> # ... with 323 more rows
However, my starting point is already a contingency table. In other words, imagine that the following ad_data_xtab
is a given:
ad_data_xtab <-
ad_data_gen_class |>
table()
ad_data_xtab
#> Class
#> Genotype Impaired Control
#> E2E2 0 2
#> E2E3 7 30
#> E2E4 1 7
#> E3E3 34 133
#> E3E4 41 65
#> E4E4 8 5
My question: given ad_data_xtab
as the starting point of my analysis, how can I nevertheless use {infer}
procedure as demonstrated in the tutorial?
One way, I guess, would be to somehow "untable" ad_data_xtab
back into ad_data_gen_class
. This has at least two limitations:
- When "un-table-ing"
ad_data_xtab
, is it guaranteed that we get exactlyad_data_gen_class
? - Unlike
ad_data_xtab
, my real data's contingency table has much larger values for counts. If I am to "un-table" it, it would result in a combinatorial explosion of millions of rows, eating up my computer's memory (likely crashing it), for apparently no good reason.
What else can I do?