Analyzing a contingency table with `{infer}`: What to do when my data is already a contingency table?

emman · November 30, 2022, 10:05am

I want to conduct a statistical inference given a two-way table of frequencies (i.e., contingency table). Thus, I'd like to use the chi-squared test of independence. I've found a very nice streamlined procedure with the {infer} package from tidymodels.

However, I cannot follow the tutorial's example given my own data. The tutorial assumes that the starting point is a dataset with two categorical columns (i.e., of type factor):

library(dplyr, warn.conflicts = FALSE)

data(ad_data, package = "modeldata")

ad_data_gen_class <- ad_data |> 
  select(Genotype, Class)

ad_data_gen_class
#> # A tibble: 333 x 2
#>    Genotype Class   
#>    <fct>    <fct>   
#>  1 E3E3     Control 
#>  2 E3E4     Control 
#>  3 E3E4     Control 
#>  4 E3E4     Control 
#>  5 E3E3     Control 
#>  6 E4E4     Impaired
#>  7 E2E3     Control 
#>  8 E2E3     Control 
#>  9 E3E3     Control 
#> 10 E2E3     Impaired
#> # ... with 323 more rows

However, my starting point is already a contingency table. In other words, imagine that the following ad_data_xtab is a given:

ad_data_xtab <- 
  ad_data_gen_class |>
  table()

ad_data_xtab
#>         Class
#> Genotype Impaired Control
#>     E2E2        0       2
#>     E2E3        7      30
#>     E2E4        1       7
#>     E3E3       34     133
#>     E3E4       41      65
#>     E4E4        8       5

My question: given ad_data_xtab as the starting point of my analysis, how can I nevertheless use {infer} procedure as demonstrated in the tutorial?

One way, I guess, would be to somehow "untable" ad_data_xtab back into ad_data_gen_class. This has at least two limitations:

When "un-table-ing" ad_data_xtab, is it guaranteed that we get exactly ad_data_gen_class?
Unlike ad_data_xtab, my real data's contingency table has much larger values for counts. If I am to "un-table" it, it would result in a combinatorial explosion of millions of rows, eating up my computer's memory (likely crashing it), for apparently no good reason.

What else can I do?

simoncouch · December 6, 2022, 10:44pm

Glad to hear that you've appreciated working with the package! A co-author responded on the cross-posted GitHub issue, and I've excerpted his response here:

I think that depends on how you go about untabling. My inclination would be to try pivot_longer() then uncount(). I'm pretty sure they constitute reliable inverse operations to table().

Uf, I think here you're running into a fundamental limitation of the way infer works right now. It's built so that it's data frame in, data frame out. That means that you'll need to process that table into a data frame before sending through an infer pipelines. It also means that the output of the generate() function can be a very large data frame (it has the number of rows in the original data frame * reps). There are benefits to this approach - it allows for inspection of those data frames generated under the null - but there are costs in terms of performance. We had at one point discussed adding an option that would do the simulation through an efficient iteration process, bypassing the big data frame, but haven't done that yet (to my knowledge).

This might be a place where chisq.test() makes more sense. It permits tabular inputs and defaults to using the asymptotic chi-square distribution of the test statistic, which should be a very good approximation if your counts are very large.

system · December 27, 2022, 10:44pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.