How can I efficiently read a large CSV file from Azure Blob Storage into R?

enesgencer18 · July 30, 2024, 12:18pm

Hi everyone,

I have the following function to read a CSV file from Azure:

read_csv_from_azure <- function(file_path, container) {
  
  # Try to download the file and handle potential errors
  tryCatch({
    # Download the file from the Azure container
    downloaded_file <- storage_download(container, file_path, NULL)
    
    # Convert the raw data to a character string
    file_content <- rawToChar(downloaded_file)
    
    # Read the CSV content using data.table's fread
    data <- fread(text = file_content, sep = ",")
    
    # Return the data
    return(data)
    
  }, error = function(e) {
    # Print an error message if an exception occurs
    message("An error occurred while downloading or reading the file: ", e)
    return(NULL)
  })
}

However, the performance of this function is not sufficient for my requirements; it takes too long to read a CSV file. The CSV files are around 30MB each.

How can I make it more efficient?

Thanks

HanOostdijk · July 30, 2024, 9:01pm

You could have a look at the CSV-import options in DuckDB.

gbravo · August 5, 2024, 7:01am

I don't have much experience with Azure but you should try not to save/export large datasets as CSV. Much better to use the parquet format instead and then import them into R using the arrow package. It seems parquet is supported in Azure: Parquet format - Azure Data Factory & Azure Synapse | Microsoft Learn

system · November 3, 2024, 7:02am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.