Split a string in a dataframe column

Tazinho · May 4, 2018, 1:24pm

Within the tidyverse you can approach this problem via stringr::str_split() or tidyr::separate(). (I chose the latter, because it's the obvious high level solution).
It is also advantageous to know a little bit about regular expressions, specifically look aheads, see here for example. Also the stringr cheatsheet is a good place to get an overview.

I modified your test data a bit since it is not clear to me, if the length of your substrings is always one, if it is possible that no underscore will appear in your data or in case of one underscore one side might be "empty".

The regex used in the following solution matches every last underscore in a string that might be followed by a group of characters which doesn't contain an underscore.

Note that a warning is thrown, because no underscore is matched in the first row and hence there is nothing to separate. Also note that in case of one underscore in connection with an "empty side" around the underscore you get an empty string "" in the regarding new column.

suppressMessages(library(tidyverse))

df <- tibble(
  x = c(
    "a",
    "a_",
    "_a",
    "a_b",
    "a_b_c",
    "a_bk_c_d",
    "ak_b_l",
    "a_b_q_r_s_hk"
  )
)

df %>% separate(x, "_(?=[^_]*$)",
                into = c("the_rest", "last_value"), 
                remove = FALSE)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> # A tibble: 8 x 3
#>   x            the_rest  last_value
#>   <chr>        <chr>     <chr>     
#> 1 a            a         <NA>      
#> 2 a_           a         ""        
#> 3 _a           ""        a         
#> 4 a_b          a         b         
#> 5 a_b_c        a_b       c         
#> 6 a_bk_c_d     a_bk_c    d         
#> 7 ak_b_l       ak_b      l         
#> 8 a_b_q_r_s_hk a_b_q_r_s hk

Created on 2018-05-04 by the reprex package (v0.2.0).