Complex Categorisation issue. Is Grepl the answer?

kezzer · February 4, 2022, 12:19pm

Hi there.

I've got a large dataset (which grows every day) where I need to create a categorisation column based on the values in another column. An example of the dataset is as follows:

  v1 
1  US2  
2  L1_US24  
3  US2_0   
4  US24
5  US245
6  US245
7  US24 L
8  US3

What I would like to do is be able to explicitly declare what is contained within the cell to be entered in the categorisation column. I'd like to achieve a result like this:

  v1          Cat 
1  US2         1  
2  L1_US24     2  
3  US2_0       1  
4  US24        2  
5  US245       3  
6  US245       3
7  US24 L      2
8  US3         5

In an ideal world, grepl would do it for me (I realise this isn't the how grepl works exactly):

Cat <- if(grepl("US24", df$v1) "2")

If anyone has any advice I'd be greatly appreciative

phiggins · February 4, 2022, 12:44pm

This seems like a nice opportunity to use case_when()

df %>%
mutate(Cat = case_when(
v1 contains("US245") ~ 3,
v1 contains("US24") ~ 2,
v1 contains("US2") ~ 1,
v1 contains("US3") ~ 5)

pathos · February 4, 2022, 1:22pm

Seems like something is missing in between v1 and contains(), right?

kezzer · February 4, 2022, 1:45pm

Seems like we're heading in the right direction, though I keep getting an "error unexpected '='" on the "mutate(Cat = case_when" line

xvalda · February 4, 2022, 1:50pm

Hi there,

I would suggest an alternative approach using tidyverse functions:

Create your example data (note: for your future postings, best would be that you actually provide example data, this way it makes it easier for the community to help you, see this link: FAQ: How to do a minimal reproducible example ( reprex ) for beginners)

library(tidyverse)
data <- tibble(reference = c("US2", "L1_US24", "US2_0", "US24", "US245", "US245", "US24 L", "US3"))

Create a dictionary of sort,

#I assume from your example that ...
#everything that contains US2 or variations thereof (excluding any digit directly after US2) is category 1
#everything that contains US24 ... is category 2
#everything that contains US245 ... is category 3
#everything that contains US3 ... is category 5
# ... you can add more of course
reference_dictionary <- tibble(short_reference = c("US2", "US24", "US245", "US3"), category = c(1, 2, 3, 5))

In your dataframe, create column that extracts the "short_reference" (I named it USX_reference in this case, I could have called it short_reference like in the reference_dictionary object but I wanted to show the structure should you have differing column names), then a left_join between the two dataframes and remove the USX_reference column that is not necessary any longer.

data %>% mutate(USX_reference = str_extract(reference, "US\\d+")) %>% 
  left_join(reference_dictionary, by = c("USX_reference" = "short_reference")) %>% 
  select(-USX_reference)

I think this is a more sustainable way to operate to add any potential new categories in your reference_dictionary object.

Hope it helps.

phiggins · February 8, 2022, 5:21pm

Had to work this one out to find the problem.
Works better with str_detect. Not enough caffeine, I am afraid.

data %>%
mutate(cat = case_when(str_detect(v1, "US245") ~3,
str_detect(v1, "US24") ~2,
str_detect(v1, "US2") ~1,
str_detect(v1, "US3") ~5))

system · February 22, 2022, 11:29am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.