I have a data of multiple sentences like "here are your pins:[p-123],[sp123] and [sp124]". I would like to extract strings and numerals between square brackets [....] and separate them in to different columns according to the initials (e.g., 'sp' or 'p-').
here are the example data:
res=data.frame(sms=c('here are your pins:[p-123],[sp123] and [sp124]','here are your pins:[p-128],[p-129] and [sp125]'),p=c('p-123','p-128,p-129'),sp=c('sp124,sp125'))
column 'sms' is the original text and column 'p' and 'sp' are the desired results.
I tried to use str_extract with somethings like "(?<=\().*(?=\))" but I cannot get rid of 'and '. Any suggests to get the desired results are appreciated.
I believe the following gets to your intended outcome.
library(tidyverse)
d = data.frame(
sms = c('here are your pins:[p-123],[sp123] and [sp124]',
'here are your pins:[p-128],[p-129] and [sp125]')
)
out = d %>%
# extract content between square brackets
mutate(content = str_extract_all(sms, '(?<=\\[)(.*?)(?=\\])')) %>%
# create one row for each piece of content identified
unnest(content) %>%
# classify each row based on content initials
mutate(initials = ifelse(substr(content,1,2) == 'p-', 'p', 'sp')) %>%
# paste and collapse content for each sms/initials group
group_by(sms, initials) %>%
mutate(output = paste(content, collapse = ',')) %>%
ungroup() %>%
# eliminate the content column and take distinct
select(-content) %>%
distinct() %>%
# widen by pivoting on initials (creates 'p' and 'sp' columns)
pivot_wider(names_from = initials, values_from = output)
out
#> # A tibble: 2 × 3
#> sms p sp
#> <chr> <chr> <chr>
#> 1 here are your pins:[p-123],[sp123] and [sp124] p-123 sp123,sp124
#> 2 here are your pins:[p-128],[p-129] and [sp125] p-128,p-129 sp125