# Create toy text file("CR")
write.table(mtcars, file = "toy_text.TXT",
col.names = FALSE,row.names = FALSE,
quote=FALSE, eol = "\r")
When I execute the above code, it generates a text file with "CR" because of the eol = "\r".
I am trying to convert CR to LF on a windows machine. Tweaking the solution given on StackOverflow How to convert CRLF to LF on a Windows machine in Python - Stack Overflow, the Python code shown below works for me. If I understand correctly, the code simply replaces \r with \n in binary mode.
How do I achieve the same result using R?
# replacement strings
WINDOWS_LINE_ENDING = b'\r' # CR
UNIX_LINE_ENDING = b'\n' # LF
# relative or absolute file path, e.g.:
file_path = "toy_text.txt"
with open(file_path, 'rb') as open_file:
content = open_file.read()
# Windows ➡ Unix
content = content.replace(WINDOWS_LINE_ENDING, UNIX_LINE_ENDING)
with open(file_path, 'wb') as open_file:
open_file.write(content)
@AlexisW Thanks a lot! Your code works perfectly for mini datasets. But, when I apply it to a larger dataset, e.g., ggplot2::diamonds, it produces a file size of 4KB.
xx <- readBin("toy_text.TXT", what = "raw",
n = 32*11*10)
The way I understand it, when you call readBin(), R will first ask the OS for memory of size n. Then it will start reading the content of the file and storing it in memory. If it finds an EOF (End of File signal) within the file, it stops reading; if it runs out of memory it stops reading.
So you need to guesstimate the size of the file before you start reading, overestimating the real size. That's what I did with n=32*11*10, because I knew the file should contain a 32 x 11 data frame, with typically less than 10 bytes per field.
Now if you really don't know anything about the file beforehand, you could try using file.size().
Recommended
Anyway, working with binary being a bit of a pain, I strongly recommend you stick with string functions. I still don't know why my previous writeLines() didn't work, but you can get it with:
xx <- readr::read_lines("toy_text.TXT",)
xx2 <- stringr::str_replace_all(xx, "\r", "\n")
readr::write_lines(xx2, "toy_text2.TXT")
The size of the file on the disk should correspond to the number of bytes in that file (thus the number of files that need to be read). See examples at the end.
But I would still add:
it might be a good idea to overestimate a bit more in case something weird happens, e.g. n = 10*file.size("myfile.TXT")
In any case, working directly with binary is more dangerous, the solution above with read_lines() and write_lines() is probably always preferable.
write.table(ggplot2::diamonds, file = "toy_text_long.TXT",
col.names = FALSE,row.names = FALSE,
quote=FALSE, eol = "\r")
write.table(mtcars, file = "toy_text_short.TXT",
col.names = FALSE,row.names = FALSE,
quote=FALSE, eol = "\r")
# read correctly short one
read_as_bin_to_text <- readBin("toy_text_short.TXT",
what = "raw",
n = file.size("toy_text_short.TXT")) |>
rawToChar() |>
strsplit("\r") |>
(\(.x) .x[[1]])()
read_as_text <- readLines("toy_text_short.TXT")
all.equal(read_as_bin_to_text, read_as_text)
#> [1] TRUE
# read long one, but with length of short (wrong)
read_as_bin_to_text <- readBin("toy_text_long.TXT",
what = "raw",
n = file.size("toy_text_short.TXT")) |>
rawToChar() |>
strsplit("\r") |>
(\(.x) .x[[1]])()
read_as_text <- readLines("toy_text_long.TXT")
all.equal(read_as_bin_to_text, read_as_text)
#> [1] "Lengths (28, 53940) differ (string compare on first 28)"
#> [2] "1 string mismatch"
#read correctly long one
read_as_bin_to_text <- readBin("toy_text_long.TXT",
what = "raw",
n = file.size("toy_text_long.TXT")) |>
rawToChar() |>
strsplit("\r") |>
(\(.x) .x[[1]])()
read_as_text <- readLines("toy_text_long.TXT")
all.equal(read_as_bin_to_text, read_as_text)
#> [1] TRUE