I know you already have an answer which works for you, but I just thought I would chime in with the base R version:
files <- c("S1_2_S1_test.txt", "S2_2_S18_test.txt", "S3_2_S9_test.txt")
gsub("\\w(\\d+_\\d+_)\\w+", "\\1new", files)
#> [1] "1_2_new.txt" "2_2_new.txt" "3_2_new.txt"
Created on 2022-04-20 by the reprex package (v2.0.0.9000)
Now, I'll explain what everything here does so you hopefully are more comfortable with regular expressions going forward.
In the pattern
:
\\w(\\d+\\d+)\\w+
\w is the character class of all "word" characters, basically a–z, A–Z, 0–9, and _. So this matches the first word character. In R you need to "escape" the "escape character" (\) or you'll get errors. The first escape is for R, the second is for the regex engine, so we write, \\w
. This captures the first 'S' in each of your file names.
S1_2_S1_test.txt
\\w (\\d+\\d+)\\w+
() parentheses create what is known as a "capture group." This is useful because it allows us to reference things in the original string later (that's what the \\1 is in the replacement pattern later on).
\\w(\\d+\\d+)\\w+
Inside the capture group we have the character class \\d which is any digit 0–9, with the modifier + which means one or more, so we will capture any non-zero number of digits (the modifier * means "zero or more" so it would capture the rest even in the absence of a digit in that position), then we have the string literal _, the underscore, then this repeats since we want two digits and underscores.
S1_2_ S1_test.txt
Incidentally, you could write the capture group as,
((\d+_){2})
Where the outer () create the capture group we want, the inner () create another group, and the {} are a modifier saying to match this pattern exactly twice.
\\w(\\d+\\d+)\\w+
Finally, the last \\w+ captures as many remaining word characters as possible. It stops before the . since that is not a "word" character.
S1_2_** S1_test**.txt
So the entire pattern captures:
S1_2_ S1_test.txt
and saves 1_2_ as "capture group" 1.
Then, in the replacement pattern we use,
\\1new
where
\\1new
\1 is the first regex capture group, in this case 1_2_ (note we need to escape it in R as well so we write \\1).
\\1new
new is just the literal string new
.
So, we replace everything before .txt
in the file name with our capture group and the string literal new
.
S1_2_S1_test.txt
becomes 1_2_new.txt
If you want to learn more about regular expressions, I particularly like the site https://regex101.com.
Here is a link to the regex101 with your sample file names and this regex already entered. It does a great job providing an explanation of the regex and the Quick Reference section is invaluable if you don't write a lot of regex often.