how to force` str_replace_all` to replace a whole number with decimal not part of it?

Is there a way to force str_replace_all to match the whole number ?
for sure rather than changing the order of the look up dictionary

str <- "98-98.1-56"
lookup <- c('98' = 'A', '98.1' = 'B', '56' = 'C')
#> "A-A.1-C"

The desired out put is:

#> "A-B-C"

Thank you :slight_smile:

Do you have a more general goal in mind? I ask because a solution to this specific question may not help with a more general approach you may be trying to implement.

1 Like
lookup2 <- lookup[order(nchar(names(lookup)), decreasing = T)]
str_replace_all(str, lookup2)

str_replace_all uses the first match in lookup it comes across. Because '98' is earlier in lookup than '98.1' it goes with A for the part of the str that includes '98.1'.

By sorting lookup in descending order of the length of its names this forces str_replace_all to review the more complex matches first.

This may help with your goal in simple cases. I agree with @dromano that a better understanding of your goal is helpful in recommending a more robust solution.

1 Like

Apologies @Hassanhijazi, I missed the part where you specified avoiding re-ordering the lookup.

I can't think of a way to get str_replace_all to work as desired. I can't see a situation where there won't be edge cases where a simple string replace would work.

Would this alternative approach work for you instead?

# split your string into separate terms on the '-' delimiter
str_vec <- str_split(str, '-')

# iterate over each term and replace with lookup value
test <- map(str_vec, \(.x) as.character(lookup[.x]))

# handle cases where the lookup doesn't cover all terms (just in case)
test <- coalesce(test[[1]], str_vec[[1]])

# reconstitute the string using the lookup values
str_output <- paste(test, collapse = '-')
3 Likes

Since it would be just as much work to produce the lookup vector by hand as to edit the str vector by hand, I'm assuming you're either supplied with the lookup vector or you're creating the lookup vector programmatically (meaning, not by hand) — is that right?

1 Like

There is, but it seems to me that @craig.parylo 's approach is more natural for this task, which raises my earlier question again:

1 Like

Thank you very much @dromano and @craig.parylo.

@dromano your concern makes total sense.
The final goal is to replace numbers which represent masses with their respective names.
Some software provide the rounded mass to the first decimal. So I built a lookup vector with the most common masses but when rounding the numbers saw that str_replace_all() is either misbehaving or I am missing something. But it turns out this is the common behavior and I have to do either like what @craig.parylo suggested or order them which is the easiest.

Thank you guys.

Then I think @craig.parylo 's second solution is the way to go, but just for completeness, here's a str_replace_all() solution (full reprex at bottom of post):

library(stringr)
str <- "98-98.1-56"

masses <- 
  # split string into character vector
  str_split_1(str, '-')

masses
#> [1] "98"   "98.1" "56"
masses_regex <- 
  masses |> 
  # replace period by regular expression representing a period:
  #  (the period is not treated by str_replace_all() as a literal period, but
  #  as a regular expression that matches any single character)
  #  (\\ blocks the special meaning of the regular expression '.' (\\.), and 
  #  also blocks its own special meaning in a regular expression (\\\\))
  str_replace_all('\\.', '\\\\.')

masses_regex
#> [1] "98"     "98\\.1" "56"

masses_regex <- 
  # add regular expressions that capture characters before and after masses:
  #  (^ and $ represent empty characters at beginning and end of string)
  #  (| means OR)
  str_c('(^|-)', masses_regex, '(-|$)')

masses_regex
#> [1] "(^|-)98(-|$)"     "(^|-)98\\.1(-|$)" "(^|-)56(-|$)"
elements <- LETTERS[1:3]

elements
#> [1] "A" "B" "C"

elements_regex <- 
  # add regular expressions that replace characters captured first (before 
  # masses) and second (after masses)
  str_c('\\1', elements, '\\2')

elements_regex
#> [1] "\\1A\\2" "\\1B\\2" "\\1C\\2"

library(purrr)
lookup_regex <- 
  elements_regex |> 
  # use masses_regex to supply names for elements_regex
  set_names(masses_regex)

lookup_regex
#>     (^|-)98(-|$) (^|-)98\\.1(-|$)     (^|-)56(-|$) 
#>        "\\1A\\2"        "\\1B\\2"        "\\1C\\2"

str |> str_replace_all(lookup_regex)
#> [1] "A-B-C"

Created on 2024-06-12 with reprex v2.0.2

Full reprex
library(stringr)
str <- "98-98.1-56"

masses <- 
  # split string into character vector
  str_split_1(str, '-')

masses
#> [1] "98"   "98.1" "56"

masses_regex <- 
  masses |> 
  # replace period by regular expression representing a period:
  #  (the period is not treated by str_replace_all() as a literal period, but
  #  as a regular expression that matches any single character)
  #  (\\ blocks the special meaning of the regular expression '.' (\\.), and 
  #  also blocks its own special meaning in a regular expression (\\\\))
  str_replace_all('\\.', '\\\\.')

masses_regex
#> [1] "98"     "98\\.1" "56"

masses_regex <- 
  # add regular expressions that capture characters before and after masses:
  #  (^ and $ represent empty characters at beginning and end of string)
  #  (| means OR)
  str_c('(^|-)', masses_regex, '(-|$)')

masses_regex
#> [1] "(^|-)98(-|$)"     "(^|-)98\\.1(-|$)" "(^|-)56(-|$)"

elements <- LETTERS[1:3]

elements
#> [1] "A" "B" "C"

elements_regex <- 
  # add regular expressions that replace characters captured first (before 
  # masses) and second (after masses)
  str_c('\\1', elements, '\\2')

elements_regex
#> [1] "\\1A\\2" "\\1B\\2" "\\1C\\2"
library(purrr)
lookup_regex <- 
  elements_regex |> 
  # use masses_regex to supply names for elements_regex
  set_names(masses_regex)

lookup_regex
#>     (^|-)98(-|$) (^|-)98\\.1(-|$)     (^|-)56(-|$) 
#>        "\\1A\\2"        "\\1B\\2"        "\\1C\\2"

str |> str_replace_all(lookup_regex)
#> [1] "A-B-C"

Created on 2024-06-12 with reprex v2.0.2

1 Like