I am reading a lot of CSV files on an HPC. I have a for loop which reads csv files, and every time it reads a file, I get an output like this:
indexing 00017942.csv [=====================================] 8.00GB/s, eta
indexing 00025702.csv [=====================================] 7.50GB/s, eta
indexing 00041004.csv [=====================================] 4.49GB/s, eta
indexing 00031140.csv [=====================================] 8.78GB/s, eta
What is this indexing? It seems like it's taking longer than it should to read files.
a naked for loop requires some function call(s) to make it read any particular data format.
What function are you relying on to read your csv files ?
I might guess readr read_csv. for performance you might benefit from adopting datatable's fread() function
Yes, sorry for not clarifying, I am using read_csv.
I am fine switching to fread, thought that would require a new dependency. These are small csv files and it doesn't seem like the reading will be the bottleneck in my code overall. But this indexing is new and perplexing.
But can I get some confirmation that this indexing is required by read_csv? Or more information on what's going on?