How can I efficiently read a large CSV file from Azure Blob Storage into R?

Hi everyone,

I have the following function to read a CSV file from Azure:

read_csv_from_azure <- function(file_path, container) {
  
  # Try to download the file and handle potential errors
  tryCatch({
    # Download the file from the Azure container
    downloaded_file <- storage_download(container, file_path, NULL)
    
    # Convert the raw data to a character string
    file_content <- rawToChar(downloaded_file)
    
    # Read the CSV content using data.table's fread
    data <- fread(text = file_content, sep = ",")
    
    # Return the data
    return(data)
    
  }, error = function(e) {
    # Print an error message if an exception occurs
    message("An error occurred while downloading or reading the file: ", e)
    return(NULL)
  })
}

However, the performance of this function is not sufficient for my requirements; it takes too long to read a CSV file. The CSV files are around 30MB each.

How can I make it more efficient?

Thanks

You could have a look at the CSV-import options in DuckDB.

I don't have much experience with Azure but you should try not to save/export large datasets as CSV. Much better to use the parquet format instead and then import them into R using the arrow package. It seems parquet is supported in Azure: Parquet format - Azure Data Factory & Azure Synapse | Microsoft Learn