Separate letters into columns for Arabic character strings

Lizz_Huntley · December 28, 2020, 10:39pm

Hello! This is my first post. Thank you in advance for your time and expertise!

In the dataframe dict below, I am trying to separate the character strings in the root_letters column so that each letter will appear in its own column (please click here to see the desired outcome) using tidyr (I tagged stringi here as well in case that might be a more suitable package).

Unfortunately, there are several issues with my output:

The biggest one is that the separate() function doesn't seem to recognize the Arabic text (see the "bad offset" warnings in the output). However, it does recognize the English text (in my actual dataframe, I am only using Arabic text. The English text is included here for reprex)
I would like the text to split from right to left, such that the first letter of the word goes in column r1, the second in r2, etc. (following Arabic text direction)
The last letter in snails is not showing (does stringr automatically parse to n-1?)

Again, thank you for your help! Please let me know if I need to provide more information.


# Load packages
library(tidyr)

# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار", "cat", "doggy", "snails")
entry <- c(1:7)

dict <- data.frame(entry, root_letters)
dict # display dataframe
#>   entry root_letters
#> 1     1            أ
#> 2     2           آب
#> 3     3       أباجور
#> 4     4          دار
#> 5     5          cat
#> 6     6        doggy
#> 7     7       snails

# Separate strings  ------------------------------
# Find the maximum number of letters in a root
long <- max(nchar(dict$root_letters))

# Separate function in tidyr package
dict_sep <- dict %>% separate(
     root_letters, # column to seprate
     "", # separate every character
     into = paste0("r", long:1), # names of new variables to create as character vector,
     remove = F, # keep original input column
     extra = "drop", # drop any extra values without a warning.
     fill = "left") # fill values on the left
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 1
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 2
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 3
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 4

dict_sep # display outcome
#>   entry root_letters   r6   r5   r4   r3 r2     r1
#> 1     1            أ <NA> <NA> <NA> <NA>         أ
#> 2     2           آب <NA> <NA> <NA> <NA>        آب
#> 3     3       أباجور <NA> <NA> <NA> <NA>    أباجور
#> 4     4          دار <NA> <NA> <NA> <NA>       دار
#> 5     5          cat <NA> <NA>         c  a      t
#> 6     6        doggy         d    o    g  g      y
#> 7     7       snails         s    n    a  i      l

Zoe_Turner · December 29, 2020, 4:37pm

Welcome I think the solution to this is related to how the letters are coded (unicode) and they are, unfortunately, not just one letter. I get the output for example:

dict # display dataframe
# entry                                     root_letters
# 1     1                                         <U+0623>
# 2     2                                 <U+0622><U+0628>
# 3     3 <U+0623><U+0628><U+0627><U+062C><U+0648><U+0631>
# 4     4                         <U+062F><U+0627><U+0631>
# 5     5                                              cat
# 6     6                                            doggy
# 7     7                                           snails

I have the same issue with my name as I have the letter ë which some systems can't cope with and I get things addressed as Zok sometimes.

I've looked for packages that can deal with Arabic letters and this https://cran.r-project.org/web/packages/arabicStemR/arabicStemR.pdf looks useful.

Lizz_Huntley · December 29, 2020, 9:36pm

Thank you for your response, Zok (jk, Zoe)! Luckily, I can get the Arabic to show up in my data.frame() output. The issue is that I can't seem to get tidyr to recognize the Arabic letters the way that it recognizes the English ones when I use the separate() function

Zoe_Turner · December 29, 2020, 9:56pm

I quite like Zok! I've had a go at resolving the tidyr thing and I also can't get it to work but I have the beginning of a workaround I think. I had to split the data into Latin and Arabic:

# Load packages
library(tidyr)
library(tidyverse)
library(stringi)

# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار", "cat", "doggy", "snails")
entry <- c(1:7)

dict <- data.frame(entry, root_letters)
dict # display dataframe

# I see the format <U+...>
# entry                                     root_letters
# 1     1                                         <U+0623>
# 2     2                                 <U+0622><U+0628>
# 3     3 <U+0623><U+0628><U+0627><U+062C><U+0648><U+0631>
# 4     4                         <U+062F><U+0627><U+0631>
# 5     5                                              cat
# 6     6                                            doggy
# 7     7                                           snails

# Separate strings  ------------------------------
# Find the maximum number of letters in a root
# The Latin script requires 1 extra or the last letter is lost 

long <- max(nchar(dict$root_letters)) + 1

# This checks for Arabic and says TRUE if it is
dict <- dict %>% 
  mutate(arabic = grepl("\\p{Arabic}", root_letters, perl = TRUE))

# Starting with Latin script , no changes were needed to the separate() part
latin <- dict %>% 
  filter(arabic == FALSE) %>% 
  select(-arabic) %>% 
  separate(root_letters, # column to seprate
                  "", # separate every character
                  into = paste0("r", long:1), # names of new variables to create as character vector,
                  remove = F, # keep original input column
                  extra = "drop", # drop any extra values without a warning.
                  fill = "left") # fill values on the left
         
# Arabic only works in separate() in the unicode form so mutate before and after to convert 
arabic <- dict %>% 
  filter(arabic == TRUE) %>% 
  select(-arabic) %>% 
  mutate(new_col = stri_escape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", root_letters))) %>% 
  separate(new_col, # column to seprate
         "\\\\u", # separate every character
         into = paste0("r", long:1), # names of new variables to create as character vector,
         remove = F, # keep original input column
         extra = "drop", # drop any extra values without a warning.
         fill = "left") %>% # fill values on the left
  mutate(another_col = paste0("<U+", r1, ">"),
         another_col_1 = stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", another_col)))

This is only the beginning as it requires going through each column/letter changing it to the unicode <U+....> format and then back to the Arabic character. Also, the Latin script reads left to right and I'm not sure if the Arabic does too. I noticed in your output requirements both needed to be right to left so that would need changing but I suspect it may do that automatically on your system if you regularly use Arabic.

I hope this helps. Perhaps it's worth putting this as an issue to the developers as a feature request, or perhaps an explanation through a vignette if we've missed something? https://github.com/tidyverse/tidyr/issues

Lizz_Huntley · December 30, 2020, 9:00pm

Hi Zoe - thank you for taking a stab at this! I'm very grateful for your time and assistance. I may have accidentally made your task more difficult by including Latin letters (my actual data file, an Arabic dictionary, only has Arabic letters - I included the Latin letters here to try and pinpoint if the problem was in my code or in the letters).

I tried your code with a modified, Arabic-only version of the sample dataframe (see below, with notes added to help me parse your code). Just wanted to let you know that:

Your code does indeed return the correct Arabic rendering of the letter.

As you suspected, the separate() function does indeed parse from left to right, such that it interprets the last letter in the Arabic word as the first. This is a bit trickier, as it means that the letter sequences (first, second) never line up properly. I will keep mulling on this, and likely repost to Github at your suggestion

# Load packages
library(tidyr)
library(tidyverse)
library(stringi)

# Create sample dataframe
#root_letters <- c("أ", "آب", "أباجور", "دار", "cat", "doggy", "snails")
#entry <- c(1:7)
root_letters <- c("أ", "آب", "أباجور", "دار")
entry <- c(1:4)

dict <- data.frame(entry, root_letters)
dict # display dataframe
#>   entry root_letters
#> 1     1            أ
#> 2     2           آب
#> 3     3       أباجور
#> 4     4          دار

# Separate strings  (Zok's way) ------------------------------
# Find the maximum number of letters in a root
long <- max(nchar(dict$root_letters)) + 1 # requires 1 extra so the last letter isn't lost

dict_sep <- dict %>% 
        mutate(new_col = # create new column for converted Arabic text
                       stri_escape_unicode( # escapes all Unicode code points
                gsub("<U\\+(....)>", "\\\\u\\1", # perform replacement of all matches (essentially, convert Arabic text)
                     root_letters))) %>% 
        separate(new_col, # column to separate
                 "\\\\u", # separate every character
                 into = paste0("r", long:1), # names of new variables to create as character vector,
                 remove = F, # keep original input column
                 extra = "drop", # drop any extra values without a warning.
                 fill = "left") %>%  # fill values on the left
        mutate(another_col = # create a new column
                       paste0("<U+", r1, ">"), # convert values in column r1?
               another_col_1 = # create a new column
                       stri_unescape_unicode( # unescape unique points
                               gsub("<U\\+(....)>", "\\\\u\\1", # convert back to Arabic
                                    another_col))) %>%
        mutate(another_col = paste0("<U+", r2, ">"),  # repeat for remaining columns
               another_col_2 = 
                       stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", another_col))) %>%
        mutate(another_col = paste0("<U+", r3, ">"), 
               another_col_3 = 
                       stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", another_col))) %>%
        mutate(another_col = paste0("<U+", r4, ">"), 
               another_col_4 = 
                       stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", another_col))) %>%
        mutate(another_col = paste0("<U+", r5, ">"), 
               another_col_5 = 
                       stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", another_col))) %>%
        mutate(another_col = paste0("<U+", r6, ">"), 
               another_col_6 = 
                       stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", another_col)))

dict_sep # print output
#>   entry root_letters                                    new_col   r7   r6   r5
#> 1     1            أ                                    \\u0623 <NA> <NA> <NA>
#> 2     2           آب                             \\u0622\\u0628 <NA> <NA> <NA>
#> 3     3       أباجور \\u0623\\u0628\\u0627\\u062c\\u0648\\u0631      0623 0628
#> 4     4          دار                      \\u062f\\u0627\\u0631 <NA> <NA> <NA>
#>     r4   r3   r2   r1 another_col another_col_1 another_col_2 another_col_3
#> 1 <NA> <NA>      0623      <U+NA>             أ          <U+>        <U+NA>
#> 2 <NA>      0622 0628      <U+NA>             ب             آ          <U+>
#> 3 0627 062c 0648 0631    <U+0623>             ر             و             ج
#> 4      062f 0627 0631      <U+NA>             ر             ا             د
#>   another_col_4 another_col_5 another_col_6
#> 1        <U+NA>        <U+NA>        <U+NA>
#> 2        <U+NA>        <U+NA>        <U+NA>
#> 3             ا             ب             أ
#> 4          <U+>        <U+NA>        <U+NA>

Lizz_Huntley · December 31, 2020, 2:52pm

Just to follow up here @Zoe_Turner and to provide a future reference to others - I tried the arabicStemR package you'd kindly suggested - worked like a charm!

The transliterate() function simply follows a 1-to-1 transliteration scheme to render the Arabic letters as Latin letters. I switched the fill() argument of separate() back to right and now the letter sequencing matches up correctly.

Drawbacks here are that the transliteration scheme is somewhat hard to read, and when I tried to revert to Arabic through the reverse.transliterate() function, it interpreted the entire column as a single string.

# Load packages
library(tidyverse)
library(arabicStemR)

# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار")
entry <- c(1:4)
dict <- data.frame(entry, root_letters)

# Use transliterate() function in arabicStemR package to 
dict <- dict %>% 
        mutate(trans_roots = transliterate(root_letters))

dict # display dataframe
#>   entry root_letters trans_roots
#> 1     1            أ           a
#> 2     2           آب          ab
#> 3     3       أباجور      abajwr
#> 4     4          دار         dar


# Separate strings courtesy of arabicStemR's transliteration -------

# Find the maximum number of letters in a root
long <- max(nchar(dict$root_letters)) + 1 # requires 1 extra so the last letter isn't lost

dict_sep <- dict %>% separate(
     trans_roots, # column to separate
     "", # separate every character
     into = paste0("r", (long + 1):1), # names of new variables to create as character vector,
     remove = F, # keep original input column
     extra = "drop", # drop any extra values without a warning.
     fill = "right") # fill values on the right

dict_sep # display outcome
#>   entry root_letters trans_roots r8 r7   r6   r5   r4   r3   r2   r1
#> 1     1            أ           a     a <NA> <NA> <NA> <NA> <NA> <NA>
#> 2     2           آب          ab     a    b <NA> <NA> <NA> <NA> <NA>
#> 3     3       أباجور      abajwr     a    b    a    j    w    r <NA>
#> 4     4          دار         dar     d    a    r <NA> <NA> <NA> <NA>

AlexisW · December 31, 2020, 9:15pm

Using a specialized library is probably best, but just in case, it does seem to work using base R and a somewhat manual approach:

ind_chars <- strsplit(dict$root_letters, split = "")
max_long <- max(sapply(ind_chars, length))
filled_chars <- lapply(ind_chars,
                       function(x) rev(c(rep(NA, max_long - length(x)), x)))
do.call(rbind, filled_chars)
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "أ"  NA   NA   NA   NA   NA  
# [2,] "ب"  "آ"  NA   NA   NA   NA  
# [3,] "ر"  "و"  "ج"  "ا"  "ب"  "أ" 
# [4,] "ر"  "ا"  "د"  NA   NA   NA  
# [5,] "t"  "a"  "c"  NA   NA   NA  
# [6,] "y"  "g"  "g"  "o"  "d"  NA  
# [7,] "s"  "l"  "i"  "a"  "n"  "s"

(it also works with stringr functions, the problem is that separate() seems to call gregexpr() with perl=TRUE,useBytes=FALSE):

str_split_fixed(dict$root_letters, "", max_long)
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "أ"  ""   ""   ""   ""   ""  
# [2,] "آ"  "ب"  ""   ""   ""   ""  
# [3,] "أ"  "ب"  "ا"  "ج"  "و"  "ر" 
# [4,] "د"  "ا"  "ر"  ""   ""   ""  
# [5,] "c"  "a"  "t"  ""   ""   ""  
# [6,] "d"  "o"  "g"  "g"  "y"  ""  
# [7,] "s"  "n"  "a"  "i"  "l"  "s"

Zoe_Turner · January 4, 2021, 9:00am

That's a nice solution! Glad the package recommendation worked out for some of it and I really appreciate the follow up as it's so lovely to see.

Lizz_Huntley · January 6, 2021, 4:07pm

Hi @AlexisW - thank you so much for taking a stab at this! It took me a while to work through your code and figure out what you were doing, but I really like the fact that your solution doesn't require transliterating the Arabic characters back and forth. My actual dataframe only has Arabic strings, so I am reproducing a simplified version here

Regarding your first solution, I played around with it a bit. If you take out the rev() function in the user-designed function within lapply() at Step 3, you actually get the letters parsed correctly from right-to-left (woo hoo!)

# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار")
entry <- c(1:4)
dict <- data.frame(entry, root_letters)
dict # display dataframe
#>   entry root_letters
#> 1     1            أ
#> 2     2           آب
#> 3     3       أباجور
#> 4     4          دار

# Step 1: split character strings of the root_letters column into substrings using strsplit() in baseR
ind_chars <- strsplit( # create a list of vectors of split character strings
        dict$root_letters, # from this column
        split = "") # split at every character


# Step 2: determine length of substrings to find length of longest substring

# sapply(): applies a function (either from the function or user-defined) to input (list, vector or data frame) and returns a vector or a matrix
max_long <- max(sapply( # find the maximum value from the vector created by...
                ind_chars, # taking the list of split character strings...
                length)) # and finding the length of each element (built-in function)


# Step 3 (original): ensure all substrings are the same length by filling in the empty elements with NA

##lapply(): applies a function (either from the function or user-defined) to input (list, vector or data frame) and returns list object
filled_chars <- lapply(ind_chars, # Apply a function to all the elements of the input
                       function(x) 
                               rev( # reverse elements in the output
                                   c(rep(NA, max_long - length(x)), x))) # make vectors equal length by replacing remaining elements with NA

# Step 4 (original) : turn the filled in list into a matrix
dict1 <- do.call(rbind, filled_chars)
dict1 # print output
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] "أ"  NA   NA   NA   NA   NA  
#> [2,] "ب"  "آ"  NA   NA   NA   NA  
#> [3,] "ر"  "و"  "ج"  "ا"  "ب"  "أ" 
#> [4,] "ر"  "ا"  "د"  NA   NA   NA

# Step 3 (without rev): ensure all substrings are the same length by filling in the empty elements with NA
filled_chars2 <- lapply(ind_chars, function(x) c(rep(NA, max_long - length(x)), x))

# Step 4 (without rev) : turn the filled in list into a matrix
dict2 <- do.call(rbind, filled_chars2)
dict2 # print output
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] NA   NA   NA   NA   NA   "أ" 
#> [2,] NA   NA   NA   NA   "آ"  "ب" 
#> [3,] "أ"  "ب"  "ا"  "ج"  "و"  "ر" 
#> [4,] NA   NA   NA   "د"  "ا"  "ر"

As for your second solution, I'm somewhat puzzled by the output. When I click to view the matrix in the source pane the letters are (mostly) appropriately parsed (R recognizes that the right-most letter is the beginning of the word, although the letters are still spit from left-to-right [the direction of split can be reversed by simply reordering the columns, so this isn't a problem]).

Strangely, however, when I print the output the parsing order changes: R incorrectly interprets the left-most letter as the beginning of the word. I'm not sure how to illustrate my source pane in a reprexable way, so I will just describe it instead:

# Separate strings - AlexisW's 2nd way ----
# Load packages
library(tidyverse)

dict3 <- stringr::str_split_fixed(dict$root_letters, "", max_long)
# When I click on dict3 in the environment pane to view it in the source pane, the parsing is correct

dict3 # when I print the output, the parsing order has been reversed (strings are matched by last letter, not by first)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] "أ"  ""   ""   ""   ""   ""  
#> [2,] "آ"  "ب"  ""   ""   ""   ""  
#> [3,] "أ"  "ب"  "ا"  "ج"  "و"  "ر" 
#> [4,] "د"  "ا"  "ر"  ""   ""   ""

Lizz_Huntley · January 13, 2021, 5:33pm

For the poor soul who stumbles across this post in search of a solution x months from now, here it is: using the function stri_split_boundaries() in the stringi package:

# Load packages
library(stringi)

# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار")
entry <- c(1:4)
dict <- data.frame(entry, root_letters)

# Separate using stringi
dict_sep <- stri_split_boundaries(dict$root_letters, 
                                  type = "character",
                                  tokens_only = T, simplify = T)

The columns then need to be reordered, but what is important is that the Arabic words have been correctly parsed such that R correctly identifies the right-most letter as the first!

system · January 20, 2021, 5:33pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.