Best format for exporting large rows

meitei · May 8, 2023, 11:03pm

I have a dataframw with rows = 2million.
I was wondering if there are better format than CSV file.
If I am using CSV file, while reading the file again, will there be any lost of data?

library(data.table)
df = data.frame(matrix(rnorm(2), nrow=2*10^6))
fwrite(df, "try.csv")

AlexisW · May 9, 2023, 1:11am

Depends on your definition of "better". The advantage of csv is that it's just plaintext and can be reopened anywhere.

For more speed, you can look at the {fst} and {qs} packages. But these formats can only be read in R (with the same package).

For speed, but with more interoperability, the parquet and feather formats are also optimized for large datasets, but have libraries available in other programming languages.

If you're only going to use part of this dataset at once, a database like duckdb will allow you to keep the data on disk and only load what you need when you need it.

There are other possibilities that could be "better" in some context, e.g. disk.frame or SQLite.

If it's only numbers (as in your example), there shouldn't be any data lost (unless there is some kind of mixup between dot and comma, or things like rownames that are hanging). If you have R objects with attributes (i.e. not a simple data.frame), or if some columns should have a particular class (for example integer vs double), that will be lost. You can usually keep those if using RDS or qs.

DavoWW · May 10, 2023, 8:17am

If you will only be using the saved data in R, then the in-built R binary options may also be used.
See help(save), help(saveRDS), and help(load).

system · June 21, 2023, 8:17am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.