regex confusion with sub function

Pearl · September 15, 2022, 11:17am

Hello all,

I am getting a result I do not understand with my regex pattern and I am hoping that someone can explain to me what I am missing. In the code below, "anelope" is yielded for the answer when I would expect it to be "anteeope". I have spent quite a bit of time contemplating how that is the correct answer, but I am just not seeing it as the solution. Can anyone explain to me what I am missing?

My argument for it being "anteeope" is as follows. The pattern matched is the vowel followed by the l, in this case an e. The replacement pattern is to replace the first captured group, which is the vowel, twice. This should yield the double e. The l is dropped.

As to why it should not be anelope, while there is a t in the second part of the pattern, it is not proceeded by a vowel directly in this word, but rather an n, so how is the t then matching the pattern?

Thank you for your time. It is much appreciated.

animals <- c('cat', 'moose', 'impala', 'antelope', 'kiwi bird', 'dog', 'goose', 'hawk')
sub(pattern = '([aeiou]*)[slwt]', replacement = '\\1\\1', x = animals )

scottyd22 · September 15, 2022, 12:19pm

I think the issue is the asterisk (*), which is saying [aeiou] followed by anything and then [slwt]. If you drop it, you get your intended result of "anteeope".

sub(pattern = '([aeiou])[slwt]', replacement = '\\1\\1', x = animals )
#> [1] "caa"       "moooe"     "impaaa"    "anteeope"  "kiii bird" "dog"      
#> [7] "goooe"     "haak"

Created on 2022-09-15 with reprex v2.0.2.9000

Pearl · September 15, 2022, 2:08pm

Thank you so much for your response. You are correct, dropping the asterisk does solve the problem, and logically it is not needed here either. I was trying to account for double vowels, but really the pattern just picks up the last vowel and consonant combination and makes the replacement. If I want to keep the asterisk in there it looks like I need to make a backreference like so:

 sub(pattern = '([aeiou])\\1*[slwt]', replacement = '\\1\\1', x = animals )

When reading the documentation for the asterisk it does clearly state that it used to make the proceeding characters optionally repeated zero or more times. So I guess, what is happening is that it is matching zero times with the nt combination. My poor recollection was that the asterisk mandated a match of at least one. However, it is the plus sign instead. Using the plus sign also solves the problem.

sub(pattern = '([aeiou])+[slwt]', replacement = '\\1\\1', x = animals )

Thank you again and cheers!

system · September 22, 2022, 2:08pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.