str_extract_all getting stuck

Tira · October 23, 2023, 2:49pm

Hi all,
I'm trying to extract text from a larger part of text. From many text files, I want to extract the shareholder letters.

start_words = c( "dear" ,...)
end_words = c("best regards",... )

start_pattern = paste0("\b(?:", paste(start_words, collapse = "|"), ")\b") ####
end_pattern = paste0("\b(?:", paste(end_words, collapse = "|"), ")\b")

This works for most files, however sometimes it gets stuck. There is no error message and I cannot get more details on the function str_extract_all to see exactly where the problem is or what is taking so long (maybe it's not stuck but it takes forever). The problem could either be that the file is too large, or there is a specific word/coding in the file that my code can't handle.

I figured the most useful workaround would be to implement a break as soon as it takes more than 1 minute to extract the letter. I tried to do that in the source code of str_extract_all but that failed. Ideally I would like to find out why this specific file causes the trouble.

This is in essence the code:

for (subfolder in sorted_subfolders) {
text_files = list.files(subfolder, pattern = ".txt" , full.names = T)
for (text_file in text_files) {
funcLetter(text_file, DF1, start_pattern, end_pattern
i= i+1
}
}

Thank you for thinking along!

Tira · October 23, 2023, 2:51pm

This is how the text is processed:

AlexisW · October 23, 2023, 5:54pm

Personally I always get confused by the lookahead/behind, so in this case I would try to extract the position of the start/end words, which would also allow me to look between steps. Here is an example with random letters from the internet (note that this code is a draft, the end word is not captured appropriately):

text <- c(" Greeting
Dear Mr./Ms. Last Name,

Body of Message
Your message should be two or three paragraphs at most and should explain why you’re writing and what you’re requesting.

Closing
Sincerely, ",
"Dear Mr. Andrews,

I’m writing to resign from my position as customer service representative, effective September 16, 2022.

I’ve recently decided to go back to school, and my program starts in late September. I’m tendering my resignation now so that I can be as helpful as possible to you during the transition.

I’ve truly enjoyed my time working with you and everyone else on our team at LMK. It’s rare to find a customer service role that offers as much opportunity to grow and learn, and perhaps more rare to find such a positive, inspiring team of people to grow and learn with.

I’m particularly grateful for your guidance while I was considering furthering my education. Your support has meant so much to me. 

Please let me know if there’s anything I can do to help you find and train my replacement.

Thanks and best wishes,

Signature (hard copy letter)")



start_words = c( "dear")
end_words = c("best regards", "best wishes", "sincerely")

match_start <- regexec(paste(start_words, collapse = "|"), text, ignore.case = TRUE)
match_end <- regexec(paste(end_words, collapse = "|"), text, ignore.case = TRUE)

purrr::pmap(list(match_start, match_end, text),
           \(start, end, txt){
             substr(txt,
                    start = start[which.min(start)],
                    stop = end[which.max(end)])
           })
#> [[1]]
#> [1] "Dear Mr./Ms. Last Name,\n\nBody of Message\nYour message should be two or three paragraphs at most and should explain why you’re writing and what you’re requesting.\n\nClosing\nS"
#> 
#> [[2]]
#> [1] "Dear Mr. Andrews,\n\nI’m writing to resign from my position as customer service representative, effective September 16, 2022.\n\nI’ve recently decided to go back to school, and my program starts in late September. I’m tendering my resignation now so that I can be as helpful as possible to you during the transition.\n\nI’ve truly enjoyed my time working with you and everyone else on our team at LMK. It’s rare to find a customer service role that offers as much opportunity to grow and learn, and perhaps more rare to find such a positive, inspiring team of people to grow and learn with.\n\nI’m particularly grateful for your guidance while I was considering furthering my education. Your support has meant so much to me. \n\nPlease let me know if there’s anything I can do to help you find and train my replacement.\n\nThanks and b"

^{Created on 2023-10-23 with reprex v2.0.2}

Tira · October 24, 2023, 11:09am

This sounds like a good approach, how would you then look at the different matches that are found in the text? I'm not familiar with prrr and I'm unable to see the actual words that were found in the text that match. Also, there can be multiple matches, which i believe is not included in your code?

AlexisW · October 24, 2023, 3:34pm

Sorry I should have used gregexec(), not regexec().

regexec() gives you the position of the match:

match_start
#> [[1]]
#> [1] 11
#> attr(,"match.length")
#> [1] 4
#> 
#> [[2]]
#> [1] 1
#> attr(,"match.length")
#> [1] 4

For example here you get a list of length 2, because there are 2 input letters in text, and, if using gregexec() each one is a vector of the starting positions of all the matches:

text <- c("Hello greeting random word dear")

start_words = c("dear", "hello", "greeting")

gregexec(paste(start_words, collapse = "|"), text, ignore.case = TRUE)
#> [[1]]
#>      [,1] [,2] [,3]
#> [1,]    1    7   28
#> attr(,"match.length")
#>      [,1] [,2] [,3]
#> [1,]    5    8    4
#> attr(,"useBytes")
#> [1] TRUE
#> attr(,"index.type")
#> [1] "chars"

You can see you have a list of length 1 because there is a single element in text, and you have matches at positions 1, 7, and 28 (and their lengths are 5, 8, and 4 characters).

Then in my map(), I use start[which.min(start)], which means take the starts, and keep the first one. That was also a bit of a mistake, it's just as easy to take the min() directly:

text <- c("as Hello greeting random word dear")

start_words = c("dear", "hello", "greeting")

match_start <- gregexec(paste(start_words, collapse = "|"), text, ignore.case = TRUE)

min(match_start[[1]])
#> [1] 4

Here the min is 4, because "Hello" starts on the 4th character of the string.

So the idea is to use gregexec() to find the positions of all the start and end words, then use min() and max() to select the most relevant ones, and finally use substr() to extract the text between these positions. Depending on whether you want to include the start and end word, you might have to play with the length of the match too. For example, to exclude the start word:

> match <- match_start[[1]]
> first_start_word <- which.min(match)
> first_start_word
[1] 1
> pos_first_start_word <- match[first_start_word]
> length_first_start_word <- attr(match, "match.length")[first_start_word]
> pos_first_start_word
[1] 4
> length_first_start_word
[1] 5
> 
> substr(text,
+        start = pos_first_start_word + length_first_start_word,
+        stop = nchar(text))
[1] " greeting random word dear"

So the nice thing is that you can run gregexec() only for start words, then only for end words, see if one of them is the problem. Then you can manually check the start and end positions to ensure they make sense before running substr() on them. And if needed you can subset these matches to find which one is problematic.

purrr::map() or pmap() are equivalents to lapply(), it's a way to run a function on each element of a list. My code above was:

res <- purrr::pmap(list(match_start, match_end, text),
           \(start, end, txt){
             substr(txt,
                    start = start[which.min(start)],
                    stop = end[which.max(end)])
           })

it could be rewritten as:

n <- length(text)
stopifnot(length(match_start) == n)
stopifnot(length(match_end) == n)

res <- vector(mode = "list", length = n)
for(i in seq_len(n)){
  start <- match_start[[i]]
  end <- match_end[[i]]
  txt <- text[[i]]

  res[[i]] <- substr(txt,
                     start = start[which.min(start)],
                     stop = end[which.max(end)])
  }

there are some advantages of using map(), for example you can easily ask for a progress bar, but if you're not familiar with {purrr}, a for loop is fine too.

system · November 14, 2023, 3:35pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.