Separate letters into columns for Arabic character strings

Lizz_Huntley · December 31, 2020, 2:52pm

Just to follow up here @Zoe_Turner and to provide a future reference to others - I tried the arabicStemR package you'd kindly suggested - worked like a charm!

The transliterate() function simply follows a 1-to-1 transliteration scheme to render the Arabic letters as Latin letters. I switched the fill() argument of separate() back to right and now the letter sequencing matches up correctly.

Drawbacks here are that the transliteration scheme is somewhat hard to read, and when I tried to revert to Arabic through the reverse.transliterate() function, it interpreted the entire column as a single string.

# Load packages
library(tidyverse)
library(arabicStemR)

# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار")
entry <- c(1:4)
dict <- data.frame(entry, root_letters)

# Use transliterate() function in arabicStemR package to 
dict <- dict %>% 
        mutate(trans_roots = transliterate(root_letters))

dict # display dataframe
#>   entry root_letters trans_roots
#> 1     1            أ           a
#> 2     2           آب          ab
#> 3     3       أباجور      abajwr
#> 4     4          دار         dar


# Separate strings courtesy of arabicStemR's transliteration -------

# Find the maximum number of letters in a root
long <- max(nchar(dict$root_letters)) + 1 # requires 1 extra so the last letter isn't lost

dict_sep <- dict %>% separate(
     trans_roots, # column to separate
     "", # separate every character
     into = paste0("r", (long + 1):1), # names of new variables to create as character vector,
     remove = F, # keep original input column
     extra = "drop", # drop any extra values without a warning.
     fill = "right") # fill values on the right

dict_sep # display outcome
#>   entry root_letters trans_roots r8 r7   r6   r5   r4   r3   r2   r1
#> 1     1            أ           a     a <NA> <NA> <NA> <NA> <NA> <NA>
#> 2     2           آب          ab     a    b <NA> <NA> <NA> <NA> <NA>
#> 3     3       أباجور      abajwr     a    b    a    j    w    r <NA>
#> 4     4          دار         dar     d    a    r <NA> <NA> <NA> <NA>