Welcome @archishman!
The help for str_subset
says it is Vectorised over string and pattern
. Since the string vector sentences
has more elements, colours
is recycled along the length of sentences
. So, for example, the first element of sentences
is tested against the pattern "^red$", the second element of sentences
is tested against the pattern "orange", etc., and the pattern vector gets recycled for every successive group of six elements of sentences
.
Thus, for example, "green" is in the 4th out of 6 positions in colours
so it will match any element of sentences
that contains the word "green" and whose index leaves a remainder of 4 when divided by 6:
sentences[grepl("green", sentences) & 1:length(sentences) %% 6 == 4]
[1] "The spot on the blotter was made by green ink."
And this is exactly what your second example matches for "green":
has_colour_test
[1] "The spot on the blotter was made by green ink." "A man in a blue sweater sat at the desk."
[3] "The sky in the west is tinged with orange red."
We can do something similar for the other colors. For example:
sentences[grepl("orange", sentences) & 1:length(sentences) %% 6 == 2]
[1] "The sky in the west is tinged with orange red."
On the other hand, your first example provides a single regular expression to str_detect
:
colour_match
"^red$|orange|yellow|green|blue|purple"
therefore every element of sentences
is tested against that single regular expression.
To use colours
but have the result come out the way we want, we could use map
to check each element of sentence
separately against each element of colours
or each element of colours
against each element of sentence
. Neither of these is as fast as str_subset(sentences, colour_match)
, but there may be other approaches that are faster.
library(tidyverse)
# Takes 40 times as long as str_subset(sentences, colour_match)
sentences %>%
map(~str_subset(.x, colours)) %>%
compact %>% unlist
# Takes twice as long as str_subset(sentences, colour_match)
colours %>%
map(~str_subset(sentences, .x)) %>%
unlist %>% unique