So, I have the following reshape function:
reshaped <- function(df){
df %>%
select(subjects, diseases) %>%
group_by(subjects, diseases) %>%
count() %>%
ungroup() %>%
spread(diseases, n, drop=TRUE)}
which essentially spreads the data from long to wide, filling the entries with number of occurrences or NA values. Now, when I spread this and print to console I get something like:
subject disease1 disease2 disease3 ...
1111111 2 3 1
1111112 21 4 2
1111115 2 1 15
1111117 1 3 1
Now, the problem is, these numbers are false. Subject 1111111
does not have 2 instances of disease1
or 3 instances of disease2
, etc. In the expected output, all the entries should be filled with NA values. In fact, these disease_1, disease2, disease3
are super-rare diseases which rarely occur at all throughout the data, but just happen to come first in the alphabetical order (all start with A).
There are some other things to be aware of. First, this is an extremely large dataset with the reshaped version having around 38,000 columns and 1.2 million rows. I use around 400 GB RAM to process this data (in chunks) and it weighs around 117 GB in compressed RDs format. I do not get any memory errors, however. Interestingly, when I sample, say, random 100,000 subjects (rows), this issue disappears. The numbers that I see are quite realistic (mostly NAs). This suggests that the problem is not with the function itself or the data, but with the observation size.
Can anyone suggest why this might be happening or how I can solve this? To me, it looks like some form of overflow where R doesn't know what to do when the data size is too big and just puts garbage values there. The problem with this hypothesis is that these values, although unreasonable, are not entirely random. They are all less than 10 or so.