Reading feather files from S3 from EC2 instance on Connect

hlendway · August 30, 2019, 12:54pm

I am running RStudio Connect version 1.7.4.1-7 with R 3.5.1 on an EC2 instance. For my shiny apps I'm reading large feather files, my largest being almost 2 GB. When I read the files locally, feather is by far the fastest but when I read from the server feather files take more than 7 times what they do locally and RData and Rds are faster. Below are the run benchmark times for local and on the Connect Server.

Local Results, Feather is

expr	min	lq	mean	median	uq	max	neval

readCSV	13	13	14	13	14	16	10
readrCSV	5.1	5.2	5.7	5.4	5.9	7.5	10
fread	1.8	1.8	1.9	1.9	2.1	2.3	10
loadRdata	7.3	7.4	7.6	7.5	7.7	7.8	10
readRds	7.3	7.4	7.5	7.5	7.6	7.8	10
readFeather	1.3	1.3	1.7	1.5	1.8	3.6	10

Here are the results run from Connect, which includes pulling the file from S3:

In the logs I do see that I frequently get the following error:

Error in value[[3L]](cond) : IO error: Memory mapping file failed
06/20 02:19:46.934
Calls: local ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
06/20 02:19:46.934
Execution halted

These 3 issues seems to be somewhat related to what I'm seeing but it sounds like an issue with feather files and it appears they will likely not be resolved.

github.com/wesm/feather

can't load large files on Windows

opened 03:27PM - 11 May 17 UTC

closed 01:44AM - 10 Apr 20 UTC

j-mark-hou

I'm currently dumping dataframes to feather from R and loading them into python.… These files range in size from 1gb to 10gb. There is a clear cutoff somewhere between 4.5 and 6.5 gb, where below this cutoff reading in the files is always totally fine, whereas above this cutoff the reading always fails on "check_status". I am on windows server 2012 r2. Error below. C:\ProgramData\Anaconda3\lib\site-packages\feather\ext.pyx in feather.ext.FeatherReader.__cinit__ (feather/ext.cpp:4460)() C:\ProgramData\Anaconda3\lib\site-packages\feather\ext.pyx in feather.ext.check_status (feather/ext.cpp:1921)() FeatherError: IO error: Memory mapping file failed

github.com/wesm/feather

IO error reading multiple Feather files in R

opened 05:12PM - 03 Jun 16 UTC

closed 05:56AM - 07 Jun 16 UTC

lmullen

I have 18,500 Feather files (all with the same columns and column types) which I… want to read in. So I do this: ``` library(feather) library(purrr) library(dplyr) paths <- Sys.glob("/media/lmullen/data/chronicling-america/out/*.feather") read_df <- failwith(NA, function(x) { message(x) read_feather(x) }) raw_l <- paths %>% map(read_df) names(raw_l) <- paths ``` When I run that code, some number of the Feather files (it varies from about 150 to 1500) fail to load with this error. ``` /media/lmullen/data/chronicling-america/out/sn85042907-1919.feather Error : IO error: Unable to open file ``` There is nothing actually wrong with those Feather files though. I can `read_feather(path_to_problem_file)` and get back a data frame as expected. I can also get the paths which failed to load, then run `map(paths_that_didnt_load, read_feather)` and load all of them fine. My only suspicion is that Feather is too fast---that it reads the files so quickly that the disk can't get to the next file in time. FWIW, the files that don't load in the batch tend to come in sequence. The files are stored on a RAID 10 array, so it's not as fast as an SSD, but it's fast. When I put a `Sys.sleep(1)` call in between loading each file, that cuts down on the number of errors. I can't think of a good way to provide a reproducible example, but happy to do so if you can give instructions. ``` > sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readr_0.2.2 purrr_0.2.1 feather_0.0.1 dplyr_0.4.3 loaded via a namespace (and not attached): [1] pryr_0.1.2 lazyeval_0.1.10 magrittr_1.5 R6_2.1.2 assertthat_0.1 parallel_3.3.0 DBI_0.4-1 [8] tools_3.3.0 tibble_1.0 Rcpp_0.12.5 stringi_1.1.1 codetools_0.2-14 stringr_1.0.0 ```

github.com/wesm/feather

Reading Feather files from a network file is much slower than RDS

opened 12:06AM - 30 Jun 18 UTC

closed 01:54AM - 10 Apr 20 UTC

maxmoro

Reading very large Feather file over the network is much slower than reading RDS… files. Is this a bug or am I doing something wrong? In this reproducible example, I wrote 524 thousand rows in local folder and in a network folder. While in the local folder feather is very quick, on a network folder, read_feather took 139s while readRDS took 1.11secs !! Can you help me? ``` library(feather) data=mtcars;for (i in 1:14) {data=rbind(data,data)} nrow(data) #524 K rows #> [1] 524288 ##### LOCAL #### r='c:/temp/dataTest.rds' f='c:/temp/dataTest.feather' ## Saving RDS on Local system.time(saveRDS(data,r)) #> user system elapsed #> 1 0 1 ## Reading RDS on Local system.time(readRDS(r)) #> user system elapsed #> 0.44 0.02 0.45 ## Saving Feather on Local system.time(feather::write_feather(data,f)) #> user system elapsed #> 0.02 0.01 0.03 ## Reading Feather on Local system.time(feather::read_feather(f)) #> user system elapsed #> 0.00 0.03 0.05 file.remove(r,f) #> [1] TRUE TRUE ##### NETWORK #### r='//server/folder/dataTest.rds' f='//server/folder/dataTest.feather' ## Saving RDS on Network system.time(saveRDS(data,r)) #> user system elapsed #> 1.08 0.05 1.33 ## Reading RDS on Network system.time(readRDS(r)) #> user system elapsed #> 0.42 0.00 1.11 ## Saving Feather on Network system.time(feather::write_feather(data,f)) #> user system elapsed #> 0.01 0.05 0.60 ## Reading Feather on Network system.time(feather::read_feather(f)) #> user system elapsed #> 0.02 0.20 139.51 file.remove(r,f) #> [1] TRUE TRUE Created on 2018-06-29 by the reprex package (v0.2.0). ```

Because all my apps have fairly large data sets I'm hoping I can find a solution that get's me much closer to that 1.7 avg. read time that I see locally. 14 seconds is not ideal for the end user experience. Any thoughts on this would be appreciated!

andrie · August 30, 2019, 1:03pm

When you want to work with larger datasets inside a Shiny app, we generally recommend investigating other options, including:

Put your data inside a database, then bring into memory only the rows you need
Create aggregated summaries with a scheduled R Markdown, then read the aggregations into your shiny app
Use caching, memoisation and shiny promises to at least maintain responsiveness to other users
Read the data only once, in the Shiny app global scope, and use a Shiny reactive poll to refresh the data as necessary

In your specific case, you may also want to investigate downloading the feather file to the local machine, then importing from local, instead of over the network.

hlendway · August 30, 2019, 1:55pm

Thanks @andrie for your thoughts.

I haven't investigated the DB option but that may be a future path.
I'm currently aggregating and filtering the data as much as I can with scheduled R markdown, putting the data in S3 and reading those aggregations from S3. (we have a lot of data).
I've researched caching/promises etc. it seems like I'd have to fully revamp my code to use promises so I haven't done this yet this is potentially a solution.
I am reading in the data globally and use reactive poll to refresh the data but it still takes ~20-30 seconds for subsequent users to load the app, so this hasn't helped with time a ton.

Can you clarify the last piece? If I download the feather file locally, wouldn't that be the same as using output files from my Rmarkdown ETL process instead of pushing them to S3? This means storing a copy of all my data on the server instead of elsewhere or are you saying temporarily copy it locally, read the file and remove it. Am I understanding this correctly or not?

Thank you very much for your thoughts!

cderv · August 30, 2019, 7:41pm

It just make me think of this post
https://data.nozav.org/post/2019-r-data-frame-benchmark/

You can try the new parquet format accessible from R with the new arrow
I don't know if it could help here

Also, there is the fst format that may be competitive if you want to stay with a file format.

A database would be the good fit to send you data in and retrieve only what is needed when needed.

hope it helps.

andrie · August 30, 2019, 8:58pm

Yes, that's my suggestion. If I understand your benchmarks correctly, then reading the feather file from the local file system is relatively fast, but very slow from S3.

So my suggestion is to download the file in the background, read it and discard it once done. You should be able to do the download part asynchronously (using promises), if you design the app carefully.

Also, regardless of how large your data is, I suspect that any specific user is most likely only looking at a portion of that data at any given time. You should consider whether you can use a database to retrieve only the portion the user cares about at that time, rather than reading all of it for every user.

rstub · September 1, 2019, 11:08am

You said the largest feather file is 2 GB. How large is the corresponding RDS file? I have the impression that most of the time for feather files is spent on downloading the data, since RDS files are compressed and feathre files are not.

hlendway · September 2, 2019, 4:13pm

@cderv thank you for pointing out those other two options I will look into it and add those to my benchmark. I'll share results here, hopefully can dedicate some time to testing that out in the next few days.

@andrie I will look into the promises piece/downloading the file in the background and see if that could be a solution. The database option is definitely something I could do in the future as that would be a more ideal solution, just more time consuming to setup up front.

@rstub the RDS file is about a quarter of the size. I understand these are compressed but they typcially read much slower which is why I went with feather. I'd like the S3 read and download to be closer to the time it takes to read adn download a file that large locally, i.e. ~ 2 seconds.

Thank you all for your ideas and feedback, I will hopefully test some things out this week and report back hear. Appreciate the help!

hlendway · September 16, 2019, 2:06pm

Quick update, I've been trying to get a test to deploy to RStudio Connect for reading fst files from S3. I'm getting the following error when I deploy to connect. I tried changing all factors to characters in my data frame and still get this error when I move to connect. It works locally, performance isn't better locally but it might be on the server, if I could get it to deploy.

09/04 13:49:01.855
Error in `levels<-`(`*tmp*`, value = as.character(levels)) :
09/04 13:49:01.855
factor level [8] is duplicated

I then went on to try the new arrow package. Again slower locally but I cannot get this to deploy to connect. I tried including the Rcpp library, not including it and also adding the arrow::install_arrow() but I still get the error message below.

library(Rcpp)
library(arrow)
arrow::install_arrow()

09/16 13:58:06.293
Error in io___MemoryMappedFile__Open(fs::path_abs(path), mode) :
09/16 13:58:06.293
Cannot call io___MemoryMappedFile__Open(). Please use arrow::install_arrow() to install required runtime libraries.
09/16 13:58:06.293
Calls: local ... shared_ptr -> shared_ptr_is_null -> io___MemoryMappedFile__Open

Not sure if anyone has gotten fst or the new arrow feather/parquet to work on connect?

hlendway · September 19, 2019, 8:58pm

Additionally, I did go through the tasks outlined in this document https://arrow.apache.org/install/ to get the necessary libraries installed on our Connect server. Still no luck.

rstub · September 20, 2019, 5:49am

An alternative for reading parquet files would be https://github.com/hannesmuehleisen/miniparquet. No dependencies and quite a bit faster.

hlendway · September 20, 2019, 12:49pm

@rstub, that would be great, but it only provides a way to read parquet so I would need to find a way to write the parquet files from Connect in my ETL. I'll continue to research options, thank you for pointing this out.

hlendway · October 2, 2019, 4:37pm

I wanted to share an update. I got FST files to work on connect. I compared 0 compression (readFST0), 50 compression(default, readFST) and 100 compression (readFST100) to feather. It appears using maximum compression I can read files about 4 times faster. I will look to implement this in my shiny apps and do additional profiling to measure load time improvement. Thanks for the help on this, I'm hoping this will be a significant improvement for the mean time.

hlendway · February 17, 2020, 4:59pm

Even though this thread is old I thought it would be worthwhile to share the updated benchmarking I was able to do. After watching Neal Richardson's talk at RStudio Conf - I was hopeful I could finally get apache arrow deploy to RStudio Connect. With the new version on cran as of Friday, I was able to do a benchmark with arrow readFeather and arrow readParquet. It appears parquet is the fastest option with fst compresed to 100 the second best option.

For reference, these are the associated file sizes in each format: