How to force` str_replace_all` to replace a whole decimal number, not part of it?

Hassanhijazi · June 11, 2024, 6:56pm

Is there a way to force str_replace_all to match the whole number ?
for sure rather than changing the order of the look up dictionary

str <- "98-98.1-56"
lookup <- c('98' = 'A', '98.1' = 'B', '56' = 'C')
#> "A-A.1-C"

The desired out put is:

#> "A-B-C"

Thank you

dromano · June 11, 2024, 7:47pm

Do you have a more general goal in mind? I ask because a solution to this specific question may not help with a more general approach you may be trying to implement.

craig.parylo · June 12, 2024, 9:23am

lookup2 <- lookup[order(nchar(names(lookup)), decreasing = T)]
str_replace_all(str, lookup2)

str_replace_all uses the first match in lookup it comes across. Because '98' is earlier in lookup than '98.1' it goes with A for the part of the str that includes '98.1'.

By sorting lookup in descending order of the length of its names this forces str_replace_all to review the more complex matches first.

This may help with your goal in simple cases. I agree with @dromano that a better understanding of your goal is helpful in recommending a more robust solution.

craig.parylo · June 12, 2024, 10:15am

Apologies @Hassanhijazi, I missed the part where you specified avoiding re-ordering the lookup.

I can't think of a way to get str_replace_all to work as desired. I can't see a situation where there won't be edge cases where a simple string replace would work.

Would this alternative approach work for you instead?

# split your string into separate terms on the '-' delimiter
str_vec <- str_split(str, '-')

# iterate over each term and replace with lookup value
test <- map(str_vec, \(.x) as.character(lookup[.x]))

# handle cases where the lookup doesn't cover all terms (just in case)
test <- coalesce(test[[1]], str_vec[[1]])

# reconstitute the string using the lookup values
str_output <- paste(test, collapse = '-')

dromano · June 12, 2024, 12:46pm

Since it would be just as much work to produce the lookup vector by hand as to edit the str vector by hand, I'm assuming you're either supplied with the lookup vector or you're creating the lookup vector programmatically (meaning, not by hand) — is that right?

dromano · June 12, 2024, 2:42pm

There is, but it seems to me that @craig.parylo 's approach is more natural for this task, which raises my earlier question again:

Hassanhijazi · June 12, 2024, 8:34pm

Thank you very much @dromano and @craig.parylo.

@dromano your concern makes total sense.
The final goal is to replace numbers which represent masses with their respective names.
Some software provide the rounded mass to the first decimal. So I built a lookup vector with the most common masses but when rounding the numbers saw that str_replace_all() is either misbehaving or I am missing something. But it turns out this is the common behavior and I have to do either like what @craig.parylo suggested or order them which is the easiest.

Thank you guys.

dromano · June 12, 2024, 10:50pm

Then I think @craig.parylo 's second solution is the way to go, but just for completeness, here's a str_replace_all() solution (full reprex at bottom of post):

library(stringr)
str <- "98-98.1-56"

masses <- 
  # split string into character vector
  str_split_1(str, '-')

masses
#> [1] "98"   "98.1" "56"

masses_regex <- 
  masses |> 
  # replace period by regular expression representing a period:
  #  (the period is not treated by str_replace_all() as a literal period, but
  #  as a regular expression that matches any single character)
  #  (\\ blocks the special meaning of the regular expression '.' (\\.), and 
  #  also blocks its own special meaning in a regular expression (\\\\))
  str_replace_all('\\.', '\\\\.')

masses_regex
#> [1] "98"     "98\\.1" "56"

masses_regex <- 
  # add regular expressions that capture characters before and after masses:
  #  (^ and $ represent empty characters at beginning and end of string)
  #  (| means OR)
  str_c('(^|-)', masses_regex, '(-|$)')

masses_regex
#> [1] "(^|-)98(-|$)"     "(^|-)98\\.1(-|$)" "(^|-)56(-|$)"

elements <- LETTERS[1:3]

elements
#> [1] "A" "B" "C"

elements_regex <- 
  # add regular expressions that replace characters captured first (before 
  # masses) and second (after masses)
  str_c('\\1', elements, '\\2')

elements_regex
#> [1] "\\1A\\2" "\\1B\\2" "\\1C\\2"

library(purrr)
lookup_regex <- 
  elements_regex |> 
  # use masses_regex to supply names for elements_regex
  set_names(masses_regex)

lookup_regex
#>     (^|-)98(-|$) (^|-)98\\.1(-|$)     (^|-)56(-|$) 
#>        "\\1A\\2"        "\\1B\\2"        "\\1C\\2"

str |> str_replace_all(lookup_regex)
#> [1] "A-B-C"

^{Created on 2024-06-12 with reprex v2.0.2}

Full reprex

library(stringr)
str <- "98-98.1-56"

masses <- 
  # split string into character vector
  str_split_1(str, '-')

masses
#> [1] "98"   "98.1" "56"

masses_regex <- 
  masses |> 
  # replace period by regular expression representing a period:
  #  (the period is not treated by str_replace_all() as a literal period, but
  #  as a regular expression that matches any single character)
  #  (\\ blocks the special meaning of the regular expression '.' (\\.), and 
  #  also blocks its own special meaning in a regular expression (\\\\))
  str_replace_all('\\.', '\\\\.')

masses_regex
#> [1] "98"     "98\\.1" "56"

masses_regex <- 
  # add regular expressions that capture characters before and after masses:
  #  (^ and $ represent empty characters at beginning and end of string)
  #  (| means OR)
  str_c('(^|-)', masses_regex, '(-|$)')

masses_regex
#> [1] "(^|-)98(-|$)"     "(^|-)98\\.1(-|$)" "(^|-)56(-|$)"

elements <- LETTERS[1:3]

elements
#> [1] "A" "B" "C"

elements_regex <- 
  # add regular expressions that replace characters captured first (before 
  # masses) and second (after masses)
  str_c('\\1', elements, '\\2')

elements_regex
#> [1] "\\1A\\2" "\\1B\\2" "\\1C\\2"

library(purrr)
lookup_regex <- 
  elements_regex |> 
  # use masses_regex to supply names for elements_regex
  set_names(masses_regex)

lookup_regex
#>     (^|-)98(-|$) (^|-)98\\.1(-|$)     (^|-)56(-|$) 
#>        "\\1A\\2"        "\\1B\\2"        "\\1C\\2"

str |> str_replace_all(lookup_regex)
#> [1] "A-B-C"

^{Created on 2024-06-12 with reprex v2.0.2}

DaniMori · June 19, 2024, 9:15am

Strictly answering your question @Hassanhijazi , what you are looking for is a regular expression with a negative lookahead:

Negative lookahead is indispensable if you want to match something not followed by something else.

Which is exactly what you are asking for, "match something" (e.g. '98'), "not followed by something else" (e.g. '.1').

So your first regular expression must match '98', but must include a "negative lookahead" to explicit that the matching pattern '98' must not be followed by '.1'. Therefore, your "lookup" vector must look like:

lookup <- c('98(?!\\.1)' = 'A', '98.1' = 'B', '56' = 'C')

Let's analyze the "weird element" here, the '98(?!\\.1)' string:

98 is the pattern you want to match; without further specification, it will match that pattern, no matter what.
(?!<string>)is the "negative lookahead" assertion; it specifies that "the previous pattern must match anything "except when it is followed by <string>.
\\.1 is <string>, the negative lookahead pattern; it specifies what the matching pattern must not be followed by (without making it part of the match).
\\. is the pattern that matches a dot (.). As a dot itself is a special character in regular expressions, it must be escaped by a backslash.
\\ is a "backslash"; as the backslash itself is a escape character in the R string syntax, it must be also escaped.

Here you have a reprex with the solution:

library(stringr)

str <- "98-98.1-56"
lookup <- c('98(?!\\.1)' = 'A', '98.1' = 'B', '56' = 'C')

str |> str_replace_all(lookup)
#> [1] "A-B-C"

There are of course other solutions; for example, you can specify that the '98' must be followed by a dash (-) character with a "positive lookahead", (i.e., '98(?=-)'), but I think the solution I proposed is the one that most meaningfully represents "98 as a whole number" (because it explicitly says "not followed by a decimal marker and the digit '1'), and also it's the easiest to generalize to other cases (e.g. "any whole number, not followed by a decimal marker and another digit").

Hope it helps!

dromano · June 19, 2024, 11:18am

Thank you for the information about negative lookahead, @DaniMori . From @Hassanhijazi 's example, you can see the question was about replacing "whole" decimals, like 98.1, by a letter, rather than replacing whole numbers in the mathematical sense; however, your solution does point the way to an alternative approach to finding a solution.

system · June 26, 2024, 11:19am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.