I'm trying to create a list of patterns I want to detect w/in a list of strings in a list-column. I want to create a function such that for each element of the list of patterns I want to use sum(str_detect( )) to find the number of strings in a list that contain that particular pattern. Then, I want to find the sum the values from str_detect >1 and divide that by the sum of all the values that result from str_detect. I want to iterate this over a list_column that where a column contains lists of strings for an observation.
A toy example of what I'm trying to do is below:
library(magrittr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
#>
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:magrittr':
#>
#> extract
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:magrittr':
#>
#> or
library(foreach)
library(stringr)
#>
#> Attaching package: 'stringr'
#> The following object is masked from 'package:rebus':
#>
#> regex
###Creating example tibble
example_tibble <- tibble(id = 1:2, strings = list(c("The cat scratched the dog", "It was a dark and stormy night", "Cats kill birds"),
c("A big, scary dog", "The dog chased the kitty")))
###Creating list of patterns to match
PatternsList<-list(c("dog"), c("cat"), c("bird"))
String_Comparison<-function(x, PatternsList){
DescriptorCounts<-foreach(i = seq_along(PatternsList)) %do% {
sum(str_detect(x, regex(pattern = PatternsList[i], ignore_case = TRUE)))
}
###Using if statement instead of filter
common_descriptors_sum <- if(any(unlist(DescriptorCounts) > 1)) {
sum(unlist(DescriptorCounts[unlist(DescriptorCounts) > 1]))
}
###Get ratio
common_ratio <- common_descriptors_sum / sum(unlist(DescriptorCounts))
return(common_ratio)
}
ExampleTibble_WithComparedStrings <- example_tibble %>%
rowwise() %>%
mutate(StringsCompared = list(String_Comparison(strings, PatternsList)))
#> Warning: There were 6 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `StringsCompared = list(String_Comparison(strings,
#> PatternsList))`.
#> ℹ In row 1.
#> Caused by warning in `regex()`:
#> ! Coercing `pattern` to a plain character vector.
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
ExampleTibble_WithComparedStrings
#> # A tibble: 2 × 3
#> # Rowwise:
#> id strings StringsCompared
#> <int> <list> <list>
#> 1 1 <chr [3]> <dbl [0]>
#> 2 2 <chr [2]> <dbl [0]>
###Returns NotANumber, which is not what I expect
###Isolating DescriptorCounts to demonstrate issue
DescriptorCounts <- function(x, PatternsList) {
foreach(i = seq_along(PatternsList)) %do% {
sum(str_detect(x, regex(pattern = PatternsList[[i]], ignore_case = TRUE)))
}
}
###Will generate lists of [0,0,0]
Output <- example_tibble %>%
rowwise() %>%
mutate(Output = list(DescriptorCounts(x = strings, PatternsList = PatternsList)))
Output$Output
#> [[1]]
#> [[1]][[1]]
#> [1] 1
#>
#> [[1]][[2]]
#> [1] 2
#>
#> [[1]][[3]]
#> [1] 1
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1] 2
#>
#> [[2]][[2]]
#> [1] 0
#>
#> [[2]][[3]]
#> [1] 0
###Okay the actual values are in there, but irretrievable?????
Created on 2024-02-04 with reprex v2.1.0
I think I have the basic structure right, but when I mutate( ) using String_Comparison on the strings column to generate a column containing the proportion of all occurrences accounted for by descriptors with multiple occurrences(i.e. sum for patterns for which sum(str_detect(strings, pattern=Patterns[i])>1) = TRUE)/cumsum(str_detect(string, pattern=Patterns)) I get a column of Not a Numbers. After isolating DescriptorCounts, it seems the issue is with how I'm using str_detect( ). I get a list of [0,0,0] in the output column in both rows. However, I expected a list of [1, 2, 1] in row one of the output column and a list of [2,0,0] in row 2. This should result in common_descriptors_sum equaling 2 for each row, and common_ratio equaling 0.5 for row one and 1 for row two.