What kind of unneeded characters is it?
First, one weird quirk is that, if you print a UTF-8 string in the console it does appear as expected, but if this string is a column of a data.frame, it is printed as Unicode character points, e.g. <U+1234>
. For example:
# create a data frame
x <- data.frame(a = "кириллица")
# display it
x
#> 1 <U+043A><U+0438><U+0440><U+0438><U+043B><U+043B><U+0438><U+0446><U+0430>
x$a
#> [1] "кириллица"
As for saving, encoding usually works better using the functions from the {readr}
package:
# save it in a file
myfile <- tempfile()
readr::write_tsv(x, myfile)
xx <- read.table(myfile, encoding = "UTF-8")
# display it
xx
#> V1
#> 1 a
#> 2 <U+043A><U+0438><U+0440><U+0438><U+043B><U+043B><U+0438><U+0446><U+0430>
xx$V1
#> [1] "a" "кириллица"
Or as a vector:
readr::write_lines(x$a, myfile)
readLines(myfile, encoding = "UTF-8")
#> [1] "кириллица"
If you really want to use base R only, it should be possible to convert beforehand with iconv()
, but that might get more complicated if trying to process a whole data frame (you might have to convert column-by-column):
writeLines(iconv(x$a, to = "UTF-8"), myfile)
readLines(myfile, encoding = "UTF-8")
#> [1] "кириллица"
write.table(iconv(x, to = "UTF-8"), myfile)
xx <- read.table(myfile, encoding = "UTF-8")
xx$x
#> [1] "кириллица"
Side note
For a reason I totally don't understand (fonts?), on this forum, the same text is written differently if it's preceded by a #
, but both appear correctly in the R console:
кириллица
# кириллица