Whatever happens, you will have some form of nested looping, on TC_3 and on keywords. On way to make the keywords one efficient is to use %in%
. Only thing is you might have to explicitly account for the case (upper/lowercase).
library(tidyverse)
df <- read.table(text = "SL_NO Index_No TC_1 TC_2 TC_3
1 17002 … … The trees in the plantation are bananas
2 25003 … … There are coconut trees 30 miles from here
3 58016 … … Sugarcane needs a lot of water to grow",
header = TRUE,
row.names=NULL,
sep = "\t")
df
#> SL_NO Index_No TC_1 TC_2 TC_3
#> 1 1 17002 … … The trees in the plantation are bananas
#> 2 2 25003 … … There are coconut trees 30 miles from here
#> 3 3 58016 … … Sugarcane needs a lot of water to grow
keywords <- scan(text = "Sugarcane
Coconut
Bananas",what = "character")
keywords
#> [1] "Sugarcane" "Coconut" "Bananas"
df |>
mutate(words_in_TC_3 = str_split(TC_3, " "),
has_match = map_lgl(words_in_TC_3,
~any(.x %in% keywords)))
#> SL_NO Index_No TC_1 TC_2 TC_3
#> 1 1 17002 … … The trees in the plantation are bananas
#> 2 2 25003 … … There are coconut trees 30 miles from here
#> 3 3 58016 … … Sugarcane needs a lot of water to grow
#> words_in_TC_3 has_match
#> 1 The, trees, in, the, plantation, are, bananas FALSE
#> 2 There, are, coconut, trees, 30, miles, from, here FALSE
#> 3 Sugarcane, needs, a, lot, of, water, to, grow TRUE
Created on 2022-05-09 by the reprex package (v2.0.1)
Another approach would be to make the loop on TC_3 more efficient, with something like:
map_dfc(keywords,
~str_detect(df$TC_3, .x)) |>
as.matrix() |>
matrixStats::rowAnys()