I don't know how to make a reprex for this, since I can't figure out what the problem is - using both readr
and vroom,
reading a fixed-width file of approx 17 million rows is adding about 6 million empty rows, starting at around 600,000. I've looked at it in a text editor, and can't find what the problem is in the areas where it's happening -- there seems to be nothing about the times it does it vs. the times it doesn't. I also tried adding the NA to the last column width using vroom_fwf
, and it still happened. It will add an empty row every row for a while, then get back in sync, then do it again.
There are ragged line endings, none of which end before the beginning of the last column. I tried reading it usng fwf_width() and fwf_positions() and in neither case does it avoid it.
I am using a Macbook and the line endings are Windows cr/lf, and I tried to specify the encoding as ISO-8859-1, which is what my text editor reports the encoding is. But it doesn't seem to matter.
The data it's reading appears to be accurate, and filtering for the empty rows is easy enough. But it just makes me nervous.
I've looked at the original data file in a text editor and its hex representation, and can't see anything that is different about when an empty row gets added and when it doesn't.
TIA
Sarah