The purpose of Apache Arrow?

Just1234 · December 14, 2023, 6:38am

I have been trying to understand Apache Arrow for a while now and i cant seem to get my head around what is it for.

The first sentence on their website is " Apache Arrow is a cross-language development platform for in-memory and larger-than-memory data".

Reading thoughout the website and watching many youtube videos using the same NY taxi data I would want to believe that I can do data analysis and data manipulation outside of the memory (the NY taxi data is something like 100-200GB, and people do analysis on their laptops) and at much faster speeds.

I now have 3 data sets saved on the R server in a .parquet format (each takes around 200mb). I then read the data in both ways:

read_parquet(the_file_to_read, as_data_frame = FALSE)

read_parquet(the_file_to_read, as_data_frame = TRUE)

Both options takes the same amount of memory (25GB) when read. Why would I even want to read data as_data_frame=FALSE?

Reading though the Apache Arrow website I got an idea it should take very little space on memory.

Is it only good for storing files?

Thanks

thisisnic · December 14, 2023, 7:33am

You're absolutely correct that Arrow can be used for working with larger-than-memory data at fast speeds.

When you read in a file using the read_*() functions (e.g. read_parquet(), read_csv_arrow() etc), they load the data into memory. The data is first read in as an Arrow Table and then pulled into your R session as a tibble. If you set as_data_frame=FALSE, then you just read it in as an Arrow Table without pulling it into your R session. This can take up less space than the tibble, and take advantage of some of the way Arrow is efficient internally.

Generally though, this isn't how you'll get the most benefit from using Arrow. If you use open_dataset() to work with the data, then you can create dplyr pipelines which will only pull data into memory once you call collect() at the end. There's more information in this vignette here.

Further benefits comes from taking advantage of partitioning, and instead of having large files, breaking your dataset into smaller files, which Arrow can treat as a single dataset. As Parquet files have metadata built into them, Arrow can work out which files it needs to read in based on any filters etc in your dplyr pipeline and the metadata in the file.

There are some excellent blog posts by Francois Michonneau which go into more detail about datasets:

system · January 25, 2024, 7:33am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.