Hello! This is my first post. Thank you in advance for your time and expertise!
In the dataframe dict below, I am trying to separate the character strings in the root_letters column so that each letter will appear in its own column (please click here to see the desired outcome) using tidyr (I tagged stringi here as well in case that might be a more suitable package).
Unfortunately, there are several issues with my output:
- The biggest one is that the
separate()
function doesn't seem to recognize the Arabic text (see the "bad offset" warnings in the output). However, it does recognize the English text (in my actual dataframe, I am only using Arabic text. The English text is included here for reprex) - I would like the text to split from right to left, such that the first letter of the word goes in column r1, the second in r2, etc. (following Arabic text direction)
- The last letter in snails is not showing (does stringr automatically parse to n-1?)
Again, thank you for your help! Please let me know if I need to provide more information.
# Load packages
library(tidyr)
# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار", "cat", "doggy", "snails")
entry <- c(1:7)
dict <- data.frame(entry, root_letters)
dict # display dataframe
#> entry root_letters
#> 1 1 أ
#> 2 2 آب
#> 3 3 أباجور
#> 4 4 دار
#> 5 5 cat
#> 6 6 doggy
#> 7 7 snails
# Separate strings ------------------------------
# Find the maximum number of letters in a root
long <- max(nchar(dict$root_letters))
# Separate function in tidyr package
dict_sep <- dict %>% separate(
root_letters, # column to seprate
"", # separate every character
into = paste0("r", long:1), # names of new variables to create as character vector,
remove = F, # keep original input column
extra = "drop", # drop any extra values without a warning.
fill = "left") # fill values on the left
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#> 'bad offset into UTF string'
#> for element 1
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#> 'bad offset into UTF string'
#> for element 2
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#> 'bad offset into UTF string'
#> for element 3
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#> 'bad offset into UTF string'
#> for element 4
dict_sep # display outcome
#> entry root_letters r6 r5 r4 r3 r2 r1
#> 1 1 أ <NA> <NA> <NA> <NA> أ
#> 2 2 آب <NA> <NA> <NA> <NA> آب
#> 3 3 أباجور <NA> <NA> <NA> <NA> أباجور
#> 4 4 دار <NA> <NA> <NA> <NA> دار
#> 5 5 cat <NA> <NA> c a t
#> 6 6 doggy d o g g y
#> 7 7 snails s n a i l