Exporting data in parquet format

pego_enrico · June 7, 2024, 2:16pm

Hello all,
I'm trying to export queried data from a BigQuery database table.
Since the resulting table can be large (2.5GB and more), I followed the suggestion "Larger datasets" from the bq_table_download() help, and I used bq_table_save() to save the data in multiple files in Google Cloud Storage.

When I tried to apply bq_table_save(), I discovered an undocumented option to export the files:
destination_format = "PARQUET"
in place of "NEWLINE_DELIMITED_JSON" or "CSV".
If I use this parameter, bq_table_save() saves correctly the data in multiple "parquet" files.

Can I use this option without problems? It seems to me that it works very well: it is very performant, and the use of parquet files saves me a lot of work to check data types.

The following code summarizes at most the code I used to export data to a Google Cloud Storage bucket:

project_id  <- "<project identifier>"
sql_dwn <- "SELECT * FROM <table from which to extract data>"
tb <- bq_project_query(project_id, sql_dwn)
bq_table_save(tb, destination_uris = "destination_bucket/folder/filename_*.parquet", destination_format="PARQUET")

Thank you in advance for your suggestions/hints.
Enrico

system · September 5, 2024, 2:16pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.