Creating package that works with non-English characters?

Is there a way to handle package development while having non-English characters in its functions? A question was put up on Stack Overflow, and I put a bounty on it (https://stackoverflow.com/questions/66361411/encoding-problem-when-your-package-contains-functions-with-non-english-character) but no one has answered the question so far---reaching out here to see if I can find an answer or an example package that I could reference.

Basically, the functions do a lot of dplyr::mutate and dplyr::filter but with non-English characters, in this case Korean. A comment suggested this was a purely Windows issue. In that case, is Unix-based development the only way, perhaps over in RStudio Cloud? I also saw the post from Tomas Kalibera here and am wondering if using the experimental build of R (May 2020) is the way forward.

(If you have an answer, please copy-paste to SO as well so that I can award you my bounty! Don't want it to evaporate :-P)

I feel your pain - while I can't speak (or read for the matter) Korean I come from non ASCII (or rather greatly extended ASCII) language background.

Without delving (too much) into character encoding hell: using unicode escape characters seems to be the way to go. These can be generated with the help of stringi::stri_escape_unicode().

So if I were in your shoes I would consider something along these lines / note that since my understanding of Korean is very limited, and your function is somewhat language specific, I can't actually confirm or rule out my approach.

sampleprob <- function(url) {
  # sample url: "http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20200330003851"
  # stringi::stri_escape_unicode("연결재무제표 주석")
  result <- grepl("\\uc5f0\\uacb0\\uc7ac\\ubb34\\uc81c\\ud45c \\uc8fc\\uc11d", html_text(read_html(url)))
  return(result)
}
1 Like

Genius! I didn't know about that particular function before in stringi. If you woud like, please feel free to paste this answer to Stack Overflow so that I can award the bounty. Thank you very much!

Thank you for your kind words, and I'm glad that the little trick worked!

Another alternative is to bundle all your Korean strings into a rds, zip or tar.gz file, and put that in your package. Then, at package load time, you read the file and use the strings as desired. This works because only R files (and maybe other text files) are checked for encoding issues; binary files like archives don't matter.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.