dplyr memory usage blows up

ibayramli · April 15, 2020, 11:23am

So, I have a dataset df 1.4 GBs big, and I am trying to reshape it using the following function:

reshaped <- function(df){
  df %>%
    select(subject_num, concept_code) %>% 
    group_by(subject_num, concept_code) %>%
    count() %>% 
    spread(concept_code, n, fill=0)
  
  return(df)
}

df = read_rds('df.RDs') %>% 
         mutate(a=paste(a, b, sep="|"))
df <- reshaped(df)
write_rds(df, 'df_reshaped.RDs')

I get: Error: cannot allocate vector of size 1205.6 GB. While debugging I discovered that the code gets stuck at the spread statement inside the reshaped function. I don't see how a dataset of 1.4 GB could ask for 1205.6 GB of memory inside the dplyr code that I wrote. Could anyone explain why is this happening and suggest a possible solution?

P.S. I send this code to a cluster to run with 400 GB RAM, so high but more reasonable memory usage would be fine.

THX1188 · April 15, 2020, 3:27pm

R RDS Files, are in some way gzip compressed files

if now your compressed file alredy has the size of 1,4G then
unpacking it in memory you run into quick problems

/THX1188

ibayramli · April 15, 2020, 4:30pm

Yeah, but I never had any issues with data manipulations on the uncompressed version with the amount of RAM available to me. Unpacked version usually occupies around 50 GB or so of memory for which I have enough space. Also, the function runs till spread without any problems, so compression is not an issue here.

siddharthprabhu · April 15, 2020, 4:41pm

These are kind of random suggestions but could you try adding an ungroup() before calling spread(); just to ensure that we're not trying to spread a grouped data frame.

Also, count() automatically calls group_by() so you could remove the latter and just add the grouping variable names into the count() function.

ibayramli · April 15, 2020, 8:06pm

tried, didn't work((

nirgrahamuk · April 15, 2020, 8:36pm

Is pivot_wider any better ?

siddharthprabhu · April 16, 2020, 5:01am

I just realized that your function returns the input data frame unchanged. That's because you haven't assigned the result of the spread() operation back to df.

Since R functions return the last value that is computed, you could fix this by removing the return(df) statement altogether. Could you try that?

ibayramli · April 16, 2020, 5:04am

Good catch! But I don't think this would fix anything as the function doesn't even get to the return statement. It gets stuck at spread.

ibayramli · April 16, 2020, 5:07am

pivot_wider gave me this error:

Error: Can't assign to elements that don't exist.
[31mx[39m The locations 1237409, 2474817, 3712225, 4949634, 6187042, etc. don't exist.
[34mℹ[39m There are only 1 element.
Backtrace:
[90m     [39m█
[90m  1. [39m├─global::reshaped(dx)
[90m  2. [39m│ └─`%>%`(...)
[90m  3. [39m│   ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
[90m  4. [39m│   └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
[90m  5. [39m│     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
[90m  6. [39m│       └─`_fseq`(`_lhs`)
[90m  7. [39m│         └─magrittr::freduce(value, `_function_list`)
[90m  8. [39m│           ├─base::withVisible(function_list[[k]](value))
[90m  9. [39m│           └─function_list[[k]](value)
[90m 10. [39m│             └─tidyr::pivot_wider(...)
[90m 11. [39m│               └─tidyr::pivot_wider_spec(...)
[90m 12. [39m│                 └─vctrs::`vec_slice<-`(...)
[90m 13. [39m└─
In addition: Warning messages:
1: In nrow * ncol : NAs produced by integer overflow
2: In nrow * (val_id$col - 1L) : NAs produced by integer overflow
3: In val_id$row + nrow * (val_id$col - 1L) :
  NAs produced by integer overflow

nirgrahamuk · April 16, 2020, 6:15am

Thanks. I'm curious about two things.
At the point where you've created a df with count

can you get the dimensions of that frame ? dim()?
can you tell the highest count value in there ? Perhaps with max()?

system · May 7, 2020, 6:15am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.