I've got a large dataset (which grows every day) where I need to create a categorisation column based on the values in another column. An example of the dataset is as follows:
What I would like to do is be able to explicitly declare what is contained within the cell to be entered in the categorisation column. I'd like to achieve a result like this:
#I assume from your example that ...
#everything that contains US2 or variations thereof (excluding any digit directly after US2) is category 1
#everything that contains US24 ... is category 2
#everything that contains US245 ... is category 3
#everything that contains US3 ... is category 5
# ... you can add more of course
reference_dictionary <- tibble(short_reference = c("US2", "US24", "US245", "US3"), category = c(1, 2, 3, 5))
In your dataframe, create column that extracts the "short_reference" (I named it USX_reference in this case, I could have called it short_reference like in the reference_dictionary object but I wanted to show the structure should you have differing column names), then a left_join between the two dataframes and remove the USX_reference column that is not necessary any longer.
data %>% mutate(USX_reference = str_extract(reference, "US\\d+")) %>%
left_join(reference_dictionary, by = c("USX_reference" = "short_reference")) %>%
select(-USX_reference)
I think this is a more sustainable way to operate to add any potential new categories in your reference_dictionary object.