How to use str_detect and case_when in R

jokoade · August 31, 2022, 11:02am

Hello everyone, would you like to convert this function in R language? Because I want to match text body by it's keyword then labeled it. Here the function with Excel:

https://exceljet.net/formula/categorize-text-with-keywords

scottyd22 · August 31, 2022, 12:14pm

The solution below does not use str_detect and case_when, but it does categorize the expenses as in the example. The approach takes each expense text string, separates it into a row for each word, joins the categories, and then keeps those rows with a match.

library(tidyverse)

expenses = data.frame(
  Expense = c('DEBIT PURCHASE AT SHELL', 'NETFLIX Payment', 'MERCHANT KROGER', 'CENTRAL WATER PAYMENT')
  )

expenses
#>                   Expense
#> 1 DEBIT PURCHASE AT SHELL
#> 2         NETFLIX Payment
#> 3         MERCHANT KROGER
#> 4   CENTRAL WATER PAYMENT

categories = data.frame(
  Keyword = c('chevron', 'costco', 'kroger', 'netflix', 'shell', 'water'),
  Category = c('Auto', 'Groceries', 'Groceries', 'Entertainment', 'Auto', 'Utilities')
)

categories
#>   Keyword      Category
#> 1 chevron          Auto
#> 2  costco     Groceries
#> 3  kroger     Groceries
#> 4 netflix Entertainment
#> 5   shell          Auto
#> 6   water     Utilities

# categorize the expenses
out = expenses %>%
  mutate(Keyword = Expense) %>%
  separate_rows(Keyword, sep = ' ') %>%
  # make lowercase to match the keywords in cateories data frame
  mutate(Keyword = tolower(Keyword)) %>%
  left_join(categories) %>%
  filter(!is.na(Category)) %>%
  select(-Keyword)
#> Joining, by = "Keyword"

out
#> # A tibble: 4 × 2
#>   Expense                 Category     
#>   <chr>                   <chr>        
#> 1 DEBIT PURCHASE AT SHELL Auto         
#> 2 NETFLIX Payment         Entertainment
#> 3 MERCHANT KROGER         Groceries    
#> 4 CENTRAL WATER PAYMENT   Utilities

Created on 2022-08-31 with reprex v2.0.2.9000v

pieterjanvc · August 31, 2022, 1:04pm

Hi,

Just to answer the title question:

#Using str_detect
#*****************
library(stringr)

#Test if the word "some" is present
text = "this is some text"
str_detect(text, "some")
#> [1] TRUE

#Use RegEx to see if the string ends with a period
text = "this is some text"
str_detect(text, "\\.$")
#> [1] FALSE

#Using case_when
#*****************
library(dplyr)

x = 80
case_when(
  x < 50 ~ "low",
  x >= 50 & x < 100 ~ "medium",
  TRUE ~ "high"
)
#> [1] "medium"

^{Created on 2022-08-31 by the reprex package (v2.0.1)}

system · October 12, 2022, 1:04pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.