I have a dataframe of tweets, many with emojis in the text field, that I want to tokenize using tidytext. Many of the emojis lack a space between them and other emojis/text, making it hard to tokenize.
example <- "My priorities Saftey First\U0001f1fa\U0001f1f8\U0001f64f What were yours?"
I would like to be able to use str_replace_all (or another option) to add a space before the emoji as below:
"My priorities Saftey First \U0001f1fa \U0001f1f8 \U0001f64f What were yours?"
I have tried using the following but get an error:
str_replace_all(example, "\\U", " \\U")
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE)
Working off of this example I also tried the below but did not seem to alter the text.
In order to identify \\U, I assume it is non-ascii. I split it, identify them, modify them and then back in the right form. I assume, there is a better a way to do that.
library(tidyverse)
example <- "My priorities Saftey First\U0001f1fa\U0001f1f8\U0001f64f What were yours?"
example
#> [1] "My priorities Saftey First\U0001f1fa\U0001f1f8\U0001f64f What were yours?"
splitwords<-strsplit(example,split = "")[[1]]
nonascii<-grepl("[^\001-\177]", splitwords)
d1<- map2_chr(splitwords, nonascii, function(x,y){if_else(y, paste(" ", x, sep = ""), x)})
d1 %>% paste(collapse = "")
#> [1] "My priorities Saftey First \U0001f1fa \U0001f1f8 \U0001f64f What were yours?"