Problem in saving file - showing unneeded characters

sbegmanov · December 28, 2021, 6:25am

Step-1: Working fine, reads Cyrillic text good
lines <- readLines("D:/R_folder/data/KR_murajat.txt", encoding = "UTF-8")
words <- strsplit(lines, " ")
Unlist = unlist(words, use.names = FALSE)

Step-2: While saving unlisted vector into csv or txt, the saved file is showing unneeded characters, not normal

write.table(Unlist, file = "D:/R_folder/data/KR_murajat_list.csv",
fileEncoding = "UTF-8", col.names = TRUE, row.names = FALSE)
Any recommendations concerning the second step? Maybe Step-1 needs changings?

AlexisW · December 29, 2021, 4:31am

What kind of unneeded characters is it?

First, one weird quirk is that, if you print a UTF-8 string in the console it does appear as expected, but if this string is a column of a data.frame, it is printed as Unicode character points, e.g. <U+1234>. For example:

# create a data frame
x <- data.frame(a = "кириллица")

# display it
x

#> 1 <U+043A><U+0438><U+0440><U+0438><U+043B><U+043B><U+0438><U+0446><U+0430>

x$a
#> [1] "кириллица"

As for saving, encoding usually works better using the functions from the {readr} package:

# save it in a file
myfile <- tempfile()

readr::write_tsv(x, myfile)

xx <- read.table(myfile, encoding = "UTF-8")

# display it
xx
#>                                                                         V1
#> 1                                                                        a
#> 2 <U+043A><U+0438><U+0440><U+0438><U+043B><U+043B><U+0438><U+0446><U+0430>

xx$V1
#> [1] "a"         "кириллица"

Or as a vector:

readr::write_lines(x$a, myfile)
readLines(myfile, encoding = "UTF-8")
#> [1] "кириллица"

If you really want to use base R only, it should be possible to convert beforehand with iconv(), but that might get more complicated if trying to process a whole data frame (you might have to convert column-by-column):

writeLines(iconv(x$a, to = "UTF-8"), myfile)
readLines(myfile, encoding = "UTF-8")
#> [1] "кириллица"

write.table(iconv(x, to = "UTF-8"), myfile)
xx <- read.table(myfile, encoding = "UTF-8")
xx$x
#> [1] "кириллица"

Side note

For a reason I totally don't understand (fonts?), on this forum, the same text is written differently if it's preceded by a #, but both appear correctly in the R console:

кириллица
# кириллица

system · January 5, 2022, 4:31am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.