Access big string variable and extract value using various delimiter

Hello everyone
I have a very particular problem with my dataset. One variable is a long string that is separated by a comma and then further by a semicolon. The problem I have is to find a particular key and extract the respective value.

Minimum Example:

code <- c("a;10,b;20,c;30","b;20,c;30","c;30,b;40","a;40,b;50")
var_1 <- c(1,4,7,10)
var_2 <- c(2,5,8,11)
df <- data.frame(code,var_1,var_2)

The String can vary in size (i.e,. there could be one with a,b,c,d,e,f in there). Let's say the codes that I want to find is "a" and "b" and I want to extract the value after the semicolon as value into a new column. In the case there are multiple keys, I want to take the sum of them. I already have success finding if a key even exists in the string by using the grepl command.

The final result should be something like this:

solution <- c(30,20,40,90)
df <- data.frame(df, solution)

Thanks!

library(tidyverse)

decode <- function(code) {
  # return a list of data frames for the decoded string vector code
  code %>% 
    str_split(",") %>% 
    map(str_split, ";", simplify = TRUE) %>% 
    map(as.data.frame) %>% 
    map(set_names, c("name", "value")) %>%
    map(mutate, value = as.numeric(value))
}

sum_these <- function(df1, to_sum1) {
  # for one data frame of name, value (df1), sum the values for name in to_sum1
  df1 %>% filter(name %in% to_sum1) %>% pull(value) %>% as.numeric() %>% sum()
}

code <- c("a;10,b;20,c;30","b;20,c;30","c;30,b;40","a;40,b;50")
var_1 <- c(1,4,7,10)
var_2 <- c(2,5,8,11)

df <- tibble(
  code, var_1, var_2, 
  val = decode(code), 
  to_sum = list(c("a", "b")), 
  result = map2_dbl(val, to_sum, sum_these)
)

print(df)
#> # A tibble: 4 x 6
#>   code           var_1 var_2 val          to_sum    result
#>   <chr>          <dbl> <dbl> <list>       <list>     <dbl>
#> 1 a;10,b;20,c;30     1     2 <df [3 x 2]> <chr [2]>     30
#> 2 b;20,c;30          4     5 <df [2 x 2]> <chr [2]>     20
#> 3 c;30,b;40          7     8 <df [2 x 2]> <chr [2]>     40
#> 4 a;40,b;50         10    11 <df [2 x 2]> <chr [2]>     90

Created on 2022-01-31 by the reprex package (v2.0.1)

1 Like
df <- data.frame(
  var_1 = c(1, 4, 7, 10),
  var_2 = c(2, 5, 8, 11),
  code = c("a;10,b;20,c;30",
                     "b;20,c;30","c;30,b;40","a;40,b;50")
)

library(tidyverse)


mlen <- max(str_count(df$code,","))
df1 <- df %>% separate(col="code",into=paste0("v_",seq_len(mlen+1)),
                                            sep=",")
df2 <- map_dfc(names(df1)[startsWith(names(df1),"v_")],
           ~separate(select(df1,.x),col=.x,
                     into = paste0(.x,letters[1:2]),
                     sep = ";"))

(index_df <- df2 %>% select(ends_with("a")) %>% as.matrix)
(value_df <- df2 %>% select(ends_with("b")) %>% mutate_all(as.integer) %>% as.matrix)

#remove values not related to the index entries we are matching on 
value_df[! index_df %in% c("a","b")] <- 0
df$values <- rowSums(value_df)
df
1 Like

Thank you for posting this. Super weird! When I copy&paste exactly your code (with fresh environment), I get a different output (the "result" is wrong).

library(tidyverse)

decode <- function(code) {
  # return a list of data frames for the decoded string vector code
  code %>% 
    str_split(",") %>% 
    map(str_split, ";", simplify = TRUE) %>% 
    map(as.data.frame) %>% 
    map(set_names, c("name", "value")) %>%
    map(mutate, value = as.numeric(value))
}

sum_these <- function(df1, to_sum1) {
  # for one data frame of name, value (df1), sum the values for name in to_sum1
  df1 %>% filter(name %in% to_sum1) %>% pull(value) %>% as.numeric() %>% sum()
}

code <- c("a;10,b;20,c;30","b;20,c;30","c;30,b;40","a;40,b;50")
var_1 <- c(1,4,7,10)
var_2 <- c(2,5,8,11)

df <- tibble(
  code, var_1, var_2, 
  val = decode(code), 
  to_sum = list(c("a", "b")), 
  result = map2_dbl(val, to_sum, sum_these)
)

print(df)
# A tibble: 4 × 6
  code           var_1 var_2 val          to_sum    result
  <chr>          <dbl> <dbl> <list>       <list>     <dbl>
1 a;10,b;20,c;30     1     2 <df [3 × 2]> <chr [2]>      3
2 b;20,c;30          4     5 <df [2 × 2]> <chr [2]>      1
3 c;30,b;40          7     8 <df [2 × 2]> <chr [2]>      2
4 a;40,b;50         10    11 <df [2 × 2]> <chr [2]>      3

I updated all packages. Do you have an idea ?

Is there a conflict with set_names? You need set_names from rlang and not magrittr. Maybe restart R and try again.

Can also print(df$val[[1]]). Should have columns name and value.

I tried restarting a few times. Also, I tried to alter the code and include rlang::set_names. The print(df$val[[1]]) gives the following output.

print(df$val[[1]])
  name value
1    a     1
2    b     2
3    c     3

That's weird. value should be 10, 20, 30.

I'm stumped!

I tried it on my windows machine and get the correct output there. I need to look into this.

as.data.frame will be subject to defaults relating to interpreting stringsAsFactors (or not).
this can be forced like so

decode <- function(code) {
  # return a list of data frames for the decoded string vector code
  code %>% 
    str_split(",") %>% 
    map(str_split, ";", simplify = TRUE) %>% 
    map(as.data.frame) %>% 
    map(set_names, c("name", "value")) %>%
    map(~mutate_all(.,as.character)) %>%
    map(mutate, value = as.numeric(value))
}
2 Likes

This solved the issue! I was even deleting my whole packages folder and only installing tidyverse (weirdly it also did not work).

But this code solves it!
Thanks @arthur.t @nirgrahamuk

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.