Trying to loop over list of patterns in str_detect( )

pone · February 4, 2024, 2:53am

I'm trying to create a list of patterns I want to detect w/in a list of strings in a list-column. I want to create a function such that for each element of the list of patterns I want to use sum(str_detect( )) to find the number of strings in a list that contain that particular pattern. Then, I want to find the sum the values from str_detect >1 and divide that by the sum of all the values that result from str_detect. I want to iterate this over a list_column that where a column contains lists of strings for an observation.

A toy example of what I'm trying to do is below:

  library(magrittr)
  library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
  library(tidyr)
#> 
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:magrittr':
#> 
#>     extract
  library(rebus)
#> 
#> Attaching package: 'rebus'
#> The following object is masked from 'package:magrittr':
#> 
#>     or
  library(foreach)
  library(stringr)
#> 
#> Attaching package: 'stringr'
#> The following object is masked from 'package:rebus':
#> 
#>     regex
###Creating example tibble
example_tibble <- tibble(id = 1:2, strings = list(c("The cat scratched the dog", "It was a dark and stormy night", "Cats kill birds"), 
                                                  c("A big, scary dog", "The dog chased the kitty")))
###Creating list of patterns to match
PatternsList<-list(c("dog"), c("cat"), c("bird"))
String_Comparison<-function(x, PatternsList){
  DescriptorCounts<-foreach(i = seq_along(PatternsList)) %do% { 
    sum(str_detect(x, regex(pattern = PatternsList[i], ignore_case = TRUE)))
  }
  ###Using if statement instead of filter
  common_descriptors_sum <- if(any(unlist(DescriptorCounts) > 1)) {
    sum(unlist(DescriptorCounts[unlist(DescriptorCounts) > 1]))
  }
  ###Get ratio 
  common_ratio <- common_descriptors_sum / sum(unlist(DescriptorCounts))
  return(common_ratio)
}
ExampleTibble_WithComparedStrings <- example_tibble %>% 
  rowwise() %>% 
  mutate(StringsCompared = list(String_Comparison(strings, PatternsList)))
#> Warning: There were 6 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `StringsCompared = list(String_Comparison(strings,
#>   PatternsList))`.
#> ℹ In row 1.
#> Caused by warning in `regex()`:
#> ! Coercing `pattern` to a plain character vector.
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
ExampleTibble_WithComparedStrings
#> # A tibble: 2 × 3
#> # Rowwise: 
#>      id strings   StringsCompared
#>   <int> <list>    <list>         
#> 1     1 <chr [3]> <dbl [0]>      
#> 2     2 <chr [2]> <dbl [0]>
###Returns NotANumber, which is not what I expect
###Isolating DescriptorCounts to demonstrate issue
DescriptorCounts <- function(x, PatternsList) {
  foreach(i = seq_along(PatternsList)) %do% { 
    sum(str_detect(x, regex(pattern = PatternsList[[i]], ignore_case = TRUE)))
  }
}
###Will generate lists of [0,0,0]
Output <- example_tibble %>% 
  rowwise() %>% 
  mutate(Output = list(DescriptorCounts(x = strings, PatternsList = PatternsList)))
Output$Output
#> [[1]]
#> [[1]][[1]]
#> [1] 1
#> 
#> [[1]][[2]]
#> [1] 2
#> 
#> [[1]][[3]]
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> [1] 2
#> 
#> [[2]][[2]]
#> [1] 0
#> 
#> [[2]][[3]]
#> [1] 0
###Okay the actual values are in there, but irretrievable?????

^{Created on 2024-02-04 with reprex v2.1.0}
I think I have the basic structure right, but when I mutate( ) using String_Comparison on the strings column to generate a column containing the proportion of all occurrences accounted for by descriptors with multiple occurrences(i.e. sum for patterns for which sum(str_detect(strings, pattern=Patterns[i])>1) = TRUE)/cumsum(str_detect(string, pattern=Patterns)) I get a column of Not a Numbers. After isolating DescriptorCounts, it seems the issue is with how I'm using str_detect( ). I get a list of [0,0,0] in the output column in both rows. However, I expected a list of [1, 2, 1] in row one of the output column and a list of [2,0,0] in row 2. This should result in common_descriptors_sum equaling 2 for each row, and common_ratio equaling 0.5 for row one and 1 for row two.

nirgrahamuk · February 5, 2024, 11:33am

when you lifted out DescriptorCounts to play with on its own, you actually fixed it, because you changed
PatternsList[i] to PatternsList[[i]]

system · February 26, 2024, 11:33am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.