Stability of feather format for data storage

The feather format for storing data is awesome, but the post introducing it last year warned it might not be stable for long-term data storage (https://blog.rstudio.com/2016/03/29/feather/). None of the updates I've seen (Wes McKinney - 2017 Outlook: pandas, Arrow, Feather, Parquet, Spark, Ibis) or the current GitHub (GitHub - wesm/feather: Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow) have that warning - but they also don't say that it is now suitable for storage. In my case, I'm basically asking if I save something to the feather format, how likely is it that trying to read it into R and use it a year or so from now is going to cause problems?

2 Likes

I don't think that I can answer this directly, but I might be able to add some additional thoughts.

I think that the future of Feather is an even tighter integration with Apache Arrow. That might be why they can't guarantee stability right now, not everything has been built/integrated yet. Specifically, from one of the blogs you linked to:

"I'm planning to merge the Feather implementation into the Arrow codebase, which will enable me to provide better performance and new features to Feather users."

There is some interesting discussion on a feather issue here, in particular, a few comments from Wes:

"This will be easy to experiment with as soon as we have preliminary Rcpp bindings for the Arrow C++ libraries (where Feather's future lies). I stand ready to add more features (multithreaded reads for R, column buffer compression, etc.) but I need the R community to help out with packaging and development lifecycle concerns. The Python version of Feather got significantly faster in many cases after moving to the Arrow codebase."

"The really interesting future for Arrow in the Python and R worlds will be as a language-agnostic native format for in-memory analytics. So you won't need to convert an Arrow table to an R data frame or pandas data frame, simply evaluate analytics (like dplyr expressions) directly on the Arrow memory (which may live in a memory map or some other virtual memory space / shared memory)"

These things are really exciting, since dplyr seems to already be set up for the backend agnostic data manipulation. I have no idea how long it might take for someone in the R community to get the Rcpp bindings to Arrow's C++ libraries working, but I did see that Jim Hester had started a small package tinkering around with it called rarrow. I assume that when it happens, it will be from the RStudio team unless someone else steps up.

5 Likes

The community should pull pun resources together for a punny name for rarrow--maybe sombrarrow? I'm picturing a table wearing a sombrero.

I'm also fond of arrrow. Maybe bow?