Indexing csv printout when running on cluster

pdeffebach · February 17, 2022, 4:39pm

Hello all,

I am reading a lot of CSV files on an HPC. I have a for loop which reads csv files, and every time it reads a file, I get an output like this:

indexing 00017942.csv [=====================================] 8.00GB/s, eta
indexing 00025702.csv [=====================================] 7.50GB/s, eta
indexing 00041004.csv [=====================================] 4.49GB/s, eta
indexing 00031140.csv [=====================================] 8.78GB/s, eta

What is this indexing? It seems like it's taking longer than it should to read files.

Any help is appreciated. Thank you.

nirgrahamuk · February 17, 2022, 5:16pm

a naked for loop requires some function call(s) to make it read any particular data format.
What function are you relying on to read your csv files ?
I might guess readr read_csv. for performance you might benefit from adopting datatable's fread() function

pdeffebach · February 17, 2022, 5:42pm

Yes, sorry for not clarifying, I am using read_csv.

I am fine switching to fread, thought that would require a new dependency. These are small csv files and it doesn't seem like the reading will be the bottleneck in my code overall. But this indexing is new and perplexing.

But can I get some confirmation that this indexing is required by read_csv? Or more information on what's going on?

nirgrahamuk · February 17, 2022, 9:15pm

I believe readr backend adopted vroom, which indexes

Maybe base R read csv is worth a try

system · March 10, 2022, 9:15pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.