I have a string and I need to match only 2 words like metabolism and increase/decrease and I need to skip all of the words. Then I will pass this pattern in str_detect to split my dataframe.
Sample string: The metabolism of Drug b can be decreased when combined with Drug a.
My RE: \b(?!Drug|when|combined|with|a|b|of|can|The)\b\S+
I can capture metabolism and decrease for the sample string
But the real dataframe does not contain any Drug a or Drug b. It contains the real name of the drug and the number of a word for the drug name can be varied from drug to drug! For example, Cyclosporine Brexpiprazole, and Ivabradine these are 2 drug names!
So are you also trying to extract the drug names? I'm confused by your question. You ask about metabolism and increase/decrease. But then go on to talk about drug names.
maybe you should type out a row as close to real as you can and then write the output you desire.
I guess I'm also confused by why you aren't doing inclusive extraction instead of exclusive. why not search for the terms instead of searching for not the terms?
to extract drug names they are all proper names right? so you could do something like...
df <-
data.table(
text = c(
"During experiements we found that Cipid Duotyllyl increases metabolism when combined with Heptaichreelgynthraene Kaspliorhite",
"During experiements we found that Sipthyde Frustrur decreases metabolism when combined with Isopheduacceite",
"During experiements we found that Philphin increases metabolism when combined with Monoapuphyodeptin Heptawonthitharhycin",
"During experiements we found that Glolfide Diifludran decreases metabolism when combined with Monoichuxyrlumphein",
"During experiements we found that Octacliusplodein increases metabolism when combined with Fonhesgechlid Diizirdolfygor"
)
)
df %>%
mutate(drug1 = str_extract_all(text, "\\s[A-Z][a-z]*(\\s[A-Z][a-z]*)?", simplify = TRUE)[, 1],
drug2 = str_extract_all(text, "\\s[A-Z][a-z]*(\\s[A-Z][a-z]*)?", simplify = TRUE)[, 2]
)
Thank you for your reply. Yes, your regex is working fine to detect the drug name.
No, I do not need the name of drugs. I only need to extract metabolism and then the word increase or decrease.
Because some strings contain The metabolism of Drug b can be decreased when combined with Drug a and sometimes The metabolism of Drug b can be increased when combined with Drug a
So, based on metabolism increased or metabolism decreased specifically using these 2 words I want to split my dataframe.
df <-
data.table(
text = c(
"During experiements we found that Cipid Duotyllyl increases metabolism when combined with Heptaichreelgynthraene Kaspliorhite",
"During experiements we found that Sipthyde Frustrur decreases metabolism when combined with Isopheduacceite",
"During experiements we found that Philphin increases metabolism when combined with Monoapuphyodeptin Heptawonthitharhycin",
"During experiements we found that Glolfide Diifludran decreases metabolism when combined with Monoichuxyrlumphein",
"During experiements we found that Octacliusplodein increases metabolism when combined with Fonhesgechlid Diizirdolfygor"
)
)
df %>%
mutate(
metabolsim = str_extract(text, "metabolism"),
inc_dec = str_extract(text, "increase|decrease")
)