Optimizing a regex of a long list pattern

roggim · January 3, 2020, 4:04pm

I have this file of 500.000 queries and I want to find different companies (30.000), locations and more long lists. However, it is taking a very long time to label. Is there a better way to do this?

g_query_samp[str_detect(g_query_samp$search_q, regex(paste0(Loc$cities, collapse = ".*|.*"), ignore_case = T)),  "city"] <- 1

technocrat · January 3, 2020, 8:25pm

Hi, and welcome!

A reproducible example, called a reprex yields more and better answers than a code fragment.

Without knowing the structure of g_query_samp or its search_q variable and how its delimited, or what assigning the statement to 1 is supposed to represent, I'd be speculating too much to provide a useful answer.

All I can say that in general you are better off vectorizing columns than subsetting. Do you have some representative data you could share in a reprex? It doesn't have to be big, just enough to show what it is that needs parsing.

system · January 24, 2020, 8:35pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.