Remove one word when it appears in the sentence with other word no matter in what order they go or how many words there are between them

gocoyd · March 26, 2023, 7:36am

I have a list of strings like this:

string <- c("tasty apple", "tasty orange", "yellow banana", "red tasty peach", "tasty banana apple", "tasty apple yellow banana", "yellow orange banana", "peach tasty apple", "yellow banana tasty peach")

When there is just one type of fruit in the string it is fine. However, when there are more than 2 of them I have a list of coexisting words and replacements (it is like a dictiorary):

pattern <- c("banana apple", "banana orange", "peach apple", "banana peach")
replacement <- c("apple", "banana", "peach", "banana")

I can remove one of fruits when they are next to each other in the string, however in my data there can be words between them and I do not know how to remove unnecessary word. The order of the words in the string might differ as well.

I want it to be like this:

Before	After
tasty apple	tasty apple
tasty orange	tasty orange
yellow banana	yellow banana
red tasty peach	red tasty peach
tasty banana apple	tasty apple
tasty apple yellow banana	tasty apple yellow
yellow orange banana	yellow banana
peach tasty apple	peach tasty
yellow banana tasty peach	yellow banana tasty

Maybe I can use some kind of regular expression to identify the words between words? But I need to save them and delete the unnecessary word only

technocrat · March 26, 2023, 10:12am

If an element of string has more than one fruit name, what is the rule of decision? First fruit wins? Or second

These are inconsistent.

gocoyd · March 26, 2023, 2:35pm

Hello @ technocrat! There order here does not matter. What matters is what types of fruit are present in the string. For example, whan banana and apple are in the same string only apple should always be left no matter what.

However I can modify my dictionary and present 2 scenarious: when banana is first and when apple is first and in both cases the replacement will be an apple. But it does not solve the problem with other words between them

technocrat · March 27, 2023, 7:57am

OK, the rules then are

apple knocks out banana
banana knocks out orange
peach knocks out apple

I'll work on that

nirgrahamuk · March 27, 2023, 9:09am

I came up with some code that follows...
I initially got 'yellow orange banana' for 7 as apparently orange is a fruit rather than a colour and so should be an option for being dropped. So I put it as the 4th priority to resolve that.
I have a remaining discrepancy on 8, as 'peach tasty apple' goes to tasty apple rathan than peach tasty, owing to apple being prioritised above peach ...

string <- c("tasty apple", 
            "tasty orange", 
            "yellow banana", 
            "red tasty peach", 
            "tasty banana apple", 
            "tasty apple yellow banana", 
            "yellow orange banana", 
            "peach tasty apple", 
            "yellow banana tasty peach")

priority <- c(
  "apple",
  "banana",
  "peach",
  "orange"
)


library(tidyverse)
(pr_df <- expand_grid(
  p1 = priority,
  drop = priority
) |>
  filter(p1 != drop) |>
  group_by(p1) |>
  mutate(rn = row_number()) |>
  pivot_wider(
    values_from = "drop",
    names_from = "rn"
  ) |>
  mutate(drops = list(str_c(pick(everything())))) |>
  select(p1, drops))


map_chr(string, \(x){
  priority_keep <- pr_df$p1[head(which(
    stringi::stri_detect_fixed(x, pr_df$p1)),
    n = 1)]
  if (length(priority_keep) == 0) {
    return(x)
  }
  drops_to_drop <- filter(
    pr_df,
    p1 == !!priority_keep
  ) |>
    pull(drops) |>
    unlist()
  for (d in drops_to_drop) {
    x <- str_replace_all(x,
      pattern = d,
      replacement = ""
    )
  }
  trimws(x |> str_replace_all(pattern=fixed("  ")," "))
})

system · April 17, 2023, 9:10am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.