.sas7bdat --> .parquet

i have a ~400gb .sas7bdat file i would like to convert to a .parquet file for easier work in R. problem is that i really don't have the storage to be reading in a 400gb .sas7dat file to convert using the arrow::write_parquet() function.

I tried just writing the file using the path to the sas7bdat file but I don't think I can do that if the sas file is not read into my R env.

tf1 <- tempfile(fileext = '.parquet')
write_parquet(paste0(D, 'file.sas7bdat'), tf1)

Yes, the function write_parquet() has a first argument which needs to be a data.frame, RecordBatch, or Table. In your example paste0(D, 'file.sas7bdat') is simply a character string. You'd need to do something like this:

sasdat_in <- haven::read_sas(paste0(D, 'file.sas7bdat'))`
write_parquet(sasdat_in,  tf1)

i see. my issue is that i cannot read in a sas file that large into R. It will not finish running or if it does it will take like a week

My first thought was that haven read_sas has arguments
skip = 0L,
n_max = Inf,
So you can read in chunks.

My second thought was that parquet is a column based format and read_sas has col_select. So perhaps you can just do 1 column at a time

Agreed, was just going to add if you don't need all the columns, you can use col_select to limit which columns are read in. Using microbenchmark(), this is generally a much faster method. If you don't know the names of columns, read in just a few rows to learn.

There's this package that essentially reads it by chunks: Convert an input file to parquet format — table_to_parquet • parquetize

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.