Unexpected entries in tibble after reshaping

ibayramli · May 3, 2020, 8:08pm

So, I have the following reshape function:

reshaped <- function(df){
df %>%
  select(subjects, diseases) %>% 
  group_by(subjects, diseases) %>%
  count() %>%
  ungroup() %>%
  spread(diseases, n, drop=TRUE)}

which essentially spreads the data from long to wide, filling the entries with number of occurrences or NA values. Now, when I spread this and print to console I get something like:

subject    disease1  disease2  disease3 ...
1111111        2         3         1
1111112        21        4         2
1111115        2         1         15
1111117        1         3         1

Now, the problem is, these numbers are false. Subject 1111111 does not have 2 instances of disease1 or 3 instances of disease2, etc. In the expected output, all the entries should be filled with NA values. In fact, these disease_1, disease2, disease3 are super-rare diseases which rarely occur at all throughout the data, but just happen to come first in the alphabetical order (all start with A).

There are some other things to be aware of. First, this is an extremely large dataset with the reshaped version having around 38,000 columns and 1.2 million rows. I use around 400 GB RAM to process this data (in chunks) and it weighs around 117 GB in compressed RDs format. I do not get any memory errors, however. Interestingly, when I sample, say, random 100,000 subjects (rows), this issue disappears. The numbers that I see are quite realistic (mostly NAs). This suggests that the problem is not with the function itself or the data, but with the observation size.

Can anyone suggest why this might be happening or how I can solve this? To me, it looks like some form of overflow where R doesn't know what to do when the data size is too big and just puts garbage values there. The problem with this hypothesis is that these values, although unreasonable, are not entirely random. They are all less than 10 or so.

siddharthprabhu · May 6, 2020, 5:39pm

I can't offer any specific advice but I noticed two things about your code snippet which may help you optimize your code (and hopefully have the side-effect of resolving the issue).

count() automatically calls group_by() on the variables passed to it and ungroup() after so you only need count(subjects, diseases).
spread() has now been superseded by the more powerful and easier-to-use pivot_wider(). You could try using that and see if you get your expected results.

nirgrahamuk · May 7, 2020, 8:58am

I think it would be good to confirm these expectations as it's not clear if you are assuming this or know it from a fact.
Have you taken your original frame and used filter to look at specifically the data for the subject and confirm what diseases they have or don't have?

system · May 28, 2020, 8:58am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.