Complex Categorisation issue. Is Grepl the answer?

xvalda · February 4, 2022, 1:50pm

Hi there,

I would suggest an alternative approach using tidyverse functions:

Create your example data (note: for your future postings, best would be that you actually provide example data, this way it makes it easier for the community to help you, see this link: FAQ: How to do a minimal reproducible example ( reprex ) for beginners)

library(tidyverse)
data <- tibble(reference = c("US2", "L1_US24", "US2_0", "US24", "US245", "US245", "US24 L", "US3"))

Create a dictionary of sort,

#I assume from your example that ...
#everything that contains US2 or variations thereof (excluding any digit directly after US2) is category 1
#everything that contains US24 ... is category 2
#everything that contains US245 ... is category 3
#everything that contains US3 ... is category 5
# ... you can add more of course
reference_dictionary <- tibble(short_reference = c("US2", "US24", "US245", "US3"), category = c(1, 2, 3, 5))

In your dataframe, create column that extracts the "short_reference" (I named it USX_reference in this case, I could have called it short_reference like in the reference_dictionary object but I wanted to show the structure should you have differing column names), then a left_join between the two dataframes and remove the USX_reference column that is not necessary any longer.

data %>% mutate(USX_reference = str_extract(reference, "US\\d+")) %>% 
  left_join(reference_dictionary, by = c("USX_reference" = "short_reference")) %>% 
  select(-USX_reference)

I think this is a more sustainable way to operate to add any potential new categories in your reference_dictionary object.

Hope it helps.