reading in files with embedded NUL's

angelotrivelli · July 20, 2023, 5:08pm

Got myself into a situation where I need to read in some text files that happen to have NUL chars embedded within them. My favorite readr function, read_file(), interprets NUL as the end-of-file stops reading upon encountering the first NUL.

These are my source data files, and I prefer to leave them exactly as they are (NUL's and everything)-- so I can't just strip it out of these files, save them, and then process them (plus, there's a lot of them).

Since I can't use readr::read_file(), I see that read_file() has a lesser known cousin, read_file_raw(). It ALMOST works...

What I got here is a tibble, logs, that has the full filepath in file. I want to put the contents of each file into the content column as a string.

logs <- logs |>
  mutate(content = map(file, ~read_file_raw(.))) |>
  unnest(cols = c('content')) |>
  mutate(content = rawToChar(content))

This appears to read in the WHOLE file. That's great. The bad news is that I don't see how to strip out the NUL chars. I had hoped that rawToChar() would just do it or provide an option to strip out NUL's, but no...

Error in `mutate()`:
ℹ In argument: `content = rawToChar(content)`.
Caused by error in `rawToChar()`:
! embedded nul in string:

Goggling around I see some base R stuff that can take a binary and replace NUL (\0's).

Unfortunately, I've deliberately forgotten most of base R and only want to use tidyverse. The base R examples are stuff like r[r!=as.raw(0)]. Don't see how to incorporate that in my dplyr stanza Is there an "easy button" to fix this?

jrkrideau · July 20, 2023, 9:05pm

Well, it's not tidyverse but there is a suggestion that fread in the data.table package might help. r - 'Embedded nul in string' error when importing csv with fread - Stack Overflow

angelotrivelli · July 20, 2023, 9:29pm

OK, I got it....

logs <- logs |>
  mutate(content = map(file, ~ read_file_raw(.))) |>
  mutate(content = map(content, ~ .[. != as.raw(0)])) |>
  mutate(content = map(content, ~ rawToChar(.))) |>
  unnest(cols = c('content'))

The problem was that I reflexively do an unnest() every time after I do a map() in a mutate().

The above seems to do the job of stripping NUL's without venturing too much outside of the tidyverse!

angelotrivelli · July 20, 2023, 9:32pm

Oh, I had forgotten about data.table!

It looks like fread() really wants tabular files. Unfortunately, my files need a lot more janitorial work before I can get them into tables.

I will definitely revisit data.table next time I got a more well-behaved input files!

Thanks!

system · August 10, 2023, 9:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.