filter function and subset of variables that do not contain a certain string of characters

alexandria23 · February 2, 2021, 10:32am

Hello,

I have a dataset with two columns of interest, one a "Response" column (where participants in a task could respond through typing what they believed a presented image was - so the class being a "character" for their responses). The second column is an "Image" column (containing the name of the actual image presented).

What I would like to do is see how many of the Responses do not match what the image actually was. As there are multiple words participants can characterise and name an object, I would also like to have several options for what is acceptable for the response to be. What I have done so far is to try and use the filter function for each of the 300 images that have been presented, including all responses to the presentation of one individual image and all responses to that image that contain the word that is correct. See below:

Image1CorrectAnswers <- data %>% filter(data$Image == "Image1.jpg", data$Response == "bike")

What I was wondering however, is 1) whether it is possible to use the filter function for responses that do not contain the correct word for that specific image? 2) As well as whether I can have multiple different "acceptable" words to "filter" out correct responses from the incorrect ones (as different participants can answer differently to the same image, and yet both be correct). The goal is to have a final variable for each of the 300 images containing only the incorrect responses.

Thank you in advance.

lars · February 2, 2021, 11:50am

There are different ways to approach this and being able to use some basics of regex can bring you a long way.

One way could be to use the filter() statement with the str_detect() function and use the ! to negate your search criteria.

library(tidyverse)
df <- tibble(sentences = head(sentences, 10))

df
#> # A tibble: 10 x 1
#>    sentences                                  
#>    <chr>                                      
#>  1 The birch canoe slid on the smooth planks. 
#>  2 Glue the sheet to the dark blue background.
#>  3 It's easy to tell the depth of a well.     
#>  4 These days a chicken leg is a rare dish.   
#>  5 Rice is often served in round bowls.       
#>  6 The juice of lemons makes fine punch.      
#>  7 The box was thrown beside the parked truck.
#>  8 The hogs were fed chopped corn and garbage.
#>  9 Four hours of steady work faced us.        
#> 10 Large size in stockings is hard to sell.

## keep any sentences which do contain the, lowercase, word 'the'
df %>% filter(str_detect(sentences, "the"))
#> # A tibble: 4 x 1
#>   sentences                                  
#>   <chr>                                      
#> 1 The birch canoe slid on the smooth planks. 
#> 2 Glue the sheet to the dark blue background.
#> 3 It's easy to tell the depth of a well.     
#> 4 The box was thrown beside the parked truck.

## remove any sentences which do contain the, lowercase, word 'the'
df %>% filter(!str_detect(sentences, "the"))
#> # A tibble: 6 x 1
#>   sentences                                  
#>   <chr>                                      
#> 1 These days a chicken leg is a rare dish.   
#> 2 Rice is often served in round bowls.       
#> 3 The juice of lemons makes fine punch.      
#> 4 The hogs were fed chopped corn and garbage.
#> 5 Four hours of steady work faced us.        
#> 6 Large size in stockings is hard to sell.

Created on 2021-02-02 by the reprex package (v1.0.0)

If you'd like to know more about these possibilities, you might find section Matching patterns with regular expressions in chapter 14 of the R for Data Science book quite useful.

lars · February 2, 2021, 12:15pm

And to answer your second question, yes you can:

library(tidyverse)
df <- tibble(sentences = head(sentences, 10))

## use `|` as an OR-operator 
df %>% filter(str_detect(sentences, "the|of|in"))
#> # A tibble: 8 x 1
#>   sentences                                  
#>   <chr>                                      
#> 1 The birch canoe slid on the smooth planks. 
#> 2 Glue the sheet to the dark blue background.
#> 3 It's easy to tell the depth of a well.     
#> 4 Rice is often served in round bowls.       
#> 5 The juice of lemons makes fine punch.      
#> 6 The box was thrown beside the parked truck.
#> 7 Four hours of steady work faced us.        
#> 8 Large size in stockings is hard to sell.

^{Created on 2021-02-02 by the reprex package (v1.0.0)}

alexandria23 · February 2, 2021, 12:19pm

Thank you very much! One last question, do you know if it is possible to have a function that captures whether the Response contains a set of words at all? Instead of being case sensitive for example. Or instead, if they answer "blue bike" instead of just "bike". As the Response "blue bike" contains the word "bike", is it possible to capture that as well?

So far, I am only able to capture the Responses when they are written specifically to how I have written in the code. So case sensitive and only the specific word.

Thank you very much in advance again and for your previous response.

lars · February 2, 2021, 1:06pm

Regarding filtering on partial matches, that is already the case. See example below, which shows results of sentences containg 'The' or 'These':

library(tidyverse)
df <- tibble(sentences = head(sentences, 10))

## keep any sentences which do have a partial match of the word 'The'
df %>% filter(str_detect(sentences, "The"))
#> # A tibble: 5 x 1
#>   sentences                                  
#>   <chr>                                      
#> 1 The birch canoe slid on the smooth planks. 
#> 2 These days a chicken leg is a rare dish.   
#> 3 The juice of lemons makes fine punch.      
#> 4 The box was thrown beside the parked truck.
#> 5 The hogs were fed chopped corn and garbage.

^{Created on 2021-02-02 by the reprex package (v1.0.0)}

If you want to find the exact word (use \\b) but non-case sensitive (use str_to_lower()) then it might look like this:

library(tidyverse)
df <- tibble(sentences = head(sentences, 10))

## keep any sentences which contain both upper/lower case word 'the' exactly
df %>% filter(str_detect(str_to_lower(sentences), "the\\b"))
#> # A tibble: 6 x 1
#>   sentences                                  
#>   <chr>                                      
#> 1 The birch canoe slid on the smooth planks. 
#> 2 Glue the sheet to the dark blue background.
#> 3 It's easy to tell the depth of a well.     
#> 4 The juice of lemons makes fine punch.      
#> 5 The box was thrown beside the parked truck.
#> 6 The hogs were fed chopped corn and garbage.

^{Created on 2021-02-02 by the reprex package (v1.0.0)}

There are other ways of achieving this, so I won't be afraid that more questions will arise, especially when you get the taste of using the power of regex

Regex can be quit daunting on how to use it correctly - at least, for me that will always be the case - and for that reason I make use of the RStudio addin RegExplain. For a quick and comprehensive overview of this addin, I hihgly recommend to visit first the RegExplain site.

Note also that RStudio offers several handy cheatsheets under the Help menu and you'll find more (including one for regex) under Browse Cheatsheets.

system · February 23, 2021, 1:06pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.