Sudden crashes in R Notebook when working with large data

Heads up - this has been cross-posted at StackOverflow about two weeks ago with no response.

I have been thoroughly searching for answers to this issue which have plagued me for a long time but have not found any solutions. As the problem is random by nature, I can unfortunately not give a minimal reproducible example.

In R notebook (run interactively within R Studio) I sometimes encounter the following error:

cannot open compressed file '/home/my_user/.rstudio-desktop/notebooks/6CD068F3-my_analysis/1/6C3E483C332D2BCE/c3ijbz2c9hida_t/_rs_rdf_5924697c2a9a.rdf', probable reason 'No such file or directory'Error in gzfile(file, "wb") : cannot open the connection

Observations on the issue:

  • This error appears seemingly randomly, meaning that a part of the code might run well first, but then suddenly this appears with no code change
  • When this has appeared, I can no longer run any commands until I have restarted the R session
  • This error occurs when I am working with relatively large datasets (~100MB)
  • Things work fine again after restarting the R session, and generally I can run through the analysis if running it in "one go"
  • The issue tends to appear when rerunning certain parts of the analysis repeatedly (for example to debug some specific part of the analysis)

Additional observation - I am often debugging the code using the browser() statement, and it is not uncommon that this issue appears after spending some time interactively debugging.

Any input on this hair-pulling issue would be highly appreciated.

RStudio generates files of this form in Notebook mode when rendering, for example, an R data.frame. Normally, we only record a subset of rows; however, this is controlled by the max.print option.

What is the output of getOption("max.print") in your session?

As an aside, is there any chance that you're running low on disk space on your system? (In case that's preventing RStudio from writing out some of the files it's using here.)

For the max.print option:

> getOption("max.print")
[1] 1000

I currently have 8.4G free on the drive where the .rstudio-desktop folder is stored, and normally have 5-10G free, so it does not seem to be acutely filled up. I also have encountered this issue on my laptop computer (also working with large dataset), where I have a similar amount of space free. I work with datasets where the derivative data files generated in the notebook could take a couple of gigabytes in space, so if everything held in memory would be written to disk then there might not be enough space.

I will try to clear up the drive so that I have plenty of space, and see whether the issue pops up again.

Alright, I managed to reproduce the error after clearing up ~30GB space.

To generate the error I did:

  1. Run through a complete analysis, loading approximately 1-2Gb into the memory (according to the gc() command)
  2. Putting a browser() statement within a function
  3. Executing that function, printing various values within it (including printing the data frame, which shows up below the cell in the notebook document with my current settings)
  4. Going to the source of the function (while still in browser mode) adding print statements to it
  5. When trying to continue interacting with it I obtained the crash, and had to restart R

I tried reproducing the procedure in a minimal function, but there I did not run into any problems after extensively editing it while being within the browser mode.

Maybe not relevant, but I post the original and updated function for reference (I don't think the exact code matters here, I have run into the error for a range of different cases).

get_protein_and_batch <- function(rdf, ddf, protein, target_batch, name) {

    browser()    
    filter_ddf <- ddf[ddf$batch == target_batch, ]
    target_vals <- rdf %>% 
        filter(query_short == protein) %>% 
        dplyr::select(filter_ddf$sample) %>% 
        unlist()
    data.frame(
        value=target_vals, 
        batch=name, 
        fertility=filter_ddf$fertility)
}

After performing changes (and obtaining the crash):

get_protein_and_batch <- function(rdf, ddf, protein, target_batch, name) {

    browser()
    filter_ddf <- ddf[ddf$batch == target_batch, ]
    print(filter_ddf)
    print(filter_ddf)
    target_vals <- rdf %>% 
        filter(query_short == protein) %>% 
        dplyr::select(filter_ddf$sample) %>% 
        unlist()
    data.frame(
        value=target_vals, 
        batch=name, 
        fertility=filter_ddf$fertility)
}

Any further input is highly appreciated. Thank you for your time!

Thanks! I am now having some success in reproducing this bug. Here's the steps I took. First, create a document with the contents:

```{r}
f <- function() {
  browser()
  mtcars
}
```

```{r}
f()
```

Run the two chunks. You'll be placed into the R debugger after running the second chunk. While the debugger is running, go back to the first chunk, and highlight and delete the browser() text. Then, continue execution.

Some small percentage of the time, I will see:

cannot open compressed file '/Users/kevinushey/scratch/blogra/.Rproj.user/shared/notebooks/537F6E18-doc/1/00B78924B974880D/c6azm2iax03j4_t/_rs_rdf_14a4658a9e341.rdf', probable reason 'No such file or directory'Error in gzfile(file, "wb") : cannot open the connection

printed into the console, with chunks no longer being runnable thereafter.

Frustratingly, this still only reproduces maybe ~20% of the time, so it's definitely an odd one!

1 Like

I've filed a bug report: https://github.com/rstudio/rstudio/issues/6260

1 Like

Yes, that looks exactly like it, thanks!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.