Detect Unicode or UTF8 characters within R markdown document (not data)

owasow · April 20, 2021, 4:14am

I've recently had a few students break their ability to knit by inserting obscure UTF8 characters into their documents. I suspect they're copying and pasting text from the Web and, as a result, accidentally inserting text that is nearly undetectable to the human eye but breaks regular latex (see example Rmd below). The copying and pasting is one issue but, apart from that, I wonder if there is any way to "Zap Gremlins" as the Mac text editor BBEdit calls the feature. Basically, I just want to be able to see the UTF8 text

---
title: "Unicode Test"
output: pdf_document
---

## How to detect Unicode?

While the comma space at the end of this clause look normal 
to the eye ->，<- it is actually a single Unicode character U+FF0C. 
It`s hard to even search and replace for because the comma-space 
is a single character (try selecting it). For more, see:

https://codepoints.net/U+FF0C?lang=en

This can easily break knitting R Markdown to a pdf (unless you 
specify xelatex as the latex_engine). Locating the offending 
Unicode within R Studio is hard. Is there any kind of `find 
Unicode Gremlins` addin or option for RStudio?

DavoWW · April 20, 2021, 7:05am

HI @owasow,
There are some useful suggestions here:
https://stackoverflow.com/questions/17291287/how-to-identify-delete-non-utf-8-characters-in-r

owasow · April 20, 2021, 3:10pm

Thank you @DavoWW. It appears, however, those suggestions on stackoverflow are for stripping UTF8 from imported data, not the R Markdown document itself.

I guess one interim solution would be to treat the R Markdown files as external text file data and run a UTF8 cleaning function on the document (code below uses this approach). Ideally, though, there would be a package / add-in so that students could easily scrub their own documents within RStudio.

# import Rmd as data
file_to_check <- readLines("path/to/my_file.Rmd")

# search for latex unfriendly characters and replace with GREMLIN  
file_checked  <- gsub('[^\x20-\x7E]', 'GREMLIN', file_to_check)

# identify rows with `GREMLIN`
rows_to_check <- grep("GREMLIN", file_checked)
rows_to_check

file_checked[rows_to_check]

system · May 11, 2021, 3:11pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.