Just to follow up here @Zoe_Turner and to provide a future reference to others - I tried the arabicStemR
package you'd kindly suggested - worked like a charm!
The transliterate()
function simply follows a 1-to-1 transliteration scheme to render the Arabic letters as Latin letters. I switched the fill()
argument of separate()
back to right and now the letter sequencing matches up correctly.
Drawbacks here are that the transliteration scheme is somewhat hard to read, and when I tried to revert to Arabic through the reverse.transliterate()
function, it interpreted the entire column as a single string.
# Load packages
library(tidyverse)
library(arabicStemR)
# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار")
entry <- c(1:4)
dict <- data.frame(entry, root_letters)
# Use transliterate() function in arabicStemR package to
dict <- dict %>%
mutate(trans_roots = transliterate(root_letters))
dict # display dataframe
#> entry root_letters trans_roots
#> 1 1 أ a
#> 2 2 آب ab
#> 3 3 أباجور abajwr
#> 4 4 دار dar
# Separate strings courtesy of arabicStemR's transliteration -------
# Find the maximum number of letters in a root
long <- max(nchar(dict$root_letters)) + 1 # requires 1 extra so the last letter isn't lost
dict_sep <- dict %>% separate(
trans_roots, # column to separate
"", # separate every character
into = paste0("r", (long + 1):1), # names of new variables to create as character vector,
remove = F, # keep original input column
extra = "drop", # drop any extra values without a warning.
fill = "right") # fill values on the right
dict_sep # display outcome
#> entry root_letters trans_roots r8 r7 r6 r5 r4 r3 r2 r1
#> 1 1 أ a a <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 2 آب ab a b <NA> <NA> <NA> <NA> <NA>
#> 3 3 أباجور abajwr a b a j w r <NA>
#> 4 4 دار dar d a r <NA> <NA> <NA> <NA>