Creating package that works with non-English characters?

sskim47 · March 21, 2021, 8:26pm

Is there a way to handle package development while having non-English characters in its functions? A question was put up on Stack Overflow, and I put a bounty on it (https://stackoverflow.com/questions/66361411/encoding-problem-when-your-package-contains-functions-with-non-english-character) but no one has answered the question so far---reaching out here to see if I can find an answer or an example package that I could reference.

Basically, the functions do a lot of dplyr::mutate and dplyr::filter but with non-English characters, in this case Korean. A comment suggested this was a purely Windows issue. In that case, is Unix-based development the only way, perhaps over in RStudio Cloud? I also saw the post from Tomas Kalibera here and am wondering if using the experimental build of R (May 2020) is the way forward.

(If you have an answer, please copy-paste to SO as well so that I can award you my bounty! Don't want it to evaporate :-P)

jlacko · March 21, 2021, 8:53pm

I feel your pain - while I can't speak (or read for the matter) Korean I come from non ASCII (or rather greatly extended ASCII) language background.

Without delving (too much) into character encoding hell: using unicode escape characters seems to be the way to go. These can be generated with the help of stringi::stri_escape_unicode().

So if I were in your shoes I would consider something along these lines / note that since my understanding of Korean is very limited, and your function is somewhat language specific, I can't actually confirm or rule out my approach.

sampleprob <- function(url) {
  # sample url: "http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20200330003851"
  # stringi::stri_escape_unicode("연결재무제표 주석")
  result <- grepl("\\uc5f0\\uacb0\\uc7ac\\ubb34\\uc81c\\ud45c \\uc8fc\\uc11d", html_text(read_html(url)))
  return(result)
}

sskim47 · March 21, 2021, 9:46pm

Genius! I didn't know about that particular function before in stringi. If you woud like, please feel free to paste this answer to Stack Overflow so that I can award the bounty. Thank you very much!

jlacko · March 22, 2021, 7:43am

Thank you for your kind words, and I'm glad that the little trick worked!

Hong · March 22, 2021, 7:54am

Another alternative is to bundle all your Korean strings into a rds, zip or tar.gz file, and put that in your package. Then, at package load time, you read the file and use the strings as desired. This works because only R files (and maybe other text files) are checked for encoding issues; binary files like archives don't matter.

system · March 29, 2021, 7:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.