Extract part of a file name using grep

I have a series of file names that look like this

files<-c("S1_2_S1_test.txt", "S2_2_S18_test.txt", "S3_2_S9_test.txt")
#> [1] "S1_2_S1_test.txt"  "S2_2_S18_test.txt" "S3_2_S9_test.txt"

Created on 2022-04-16 by the reprex package (v2.0.1)

I want to remove certain parts of the file names and make them look like this

"1_2_new.txt"  "2_2_new.txt" "3_2_new.txt"

regex are really pain for me. I am looking at tutorials but I have not found anything that suits my case.
Any help or direction are appreciated

So far I can go that far

files %>% 

[1] "1_2_1_new.txt"  "2_2_18_new.txt" "3_2_9_new.txt" 

Can I do it this using grep or str_extract in one line?
How do I replace the number after the second "_"

thank you in advance

Here is one solution.

files<-c("S1_2_S1_test.txt", "S2_2_S18_test.txt", "S3_2_S9_test.txt")
str_replace(files, pattern = "S(\\d+_\\d+).+", replacement = "\\1_new.txt")
[1] "1_2_new.txt" "2_2_new.txt" "3_2_new.txt"
1 Like


For this example you could try

files %>% str_extract(., "\\d_\\d" ) %>% str_c("new. txt", sep = "" )


1 Like

I know you already have an answer which works for you, but I just thought I would chime in with the base R version:

files <- c("S1_2_S1_test.txt", "S2_2_S18_test.txt", "S3_2_S9_test.txt")
gsub("\\w(\\d+_\\d+_)\\w+", "\\1new", files)
#> [1] "1_2_new.txt" "2_2_new.txt" "3_2_new.txt"

Created on 2022-04-20 by the reprex package (v2.0.0.9000)

Now, I'll explain what everything here does so you hopefully are more comfortable with regular expressions going forward.

In the pattern:


\w is the character class of all "word" characters, basically a–z, A–Z, 0–9, and _. So this matches the first word character. In R you need to "escape" the "escape character" (\) or you'll get errors. The first escape is for R, the second is for the regex engine, so we write, \\w. This captures the first 'S' in each of your file names.


\\w (\\d+\\d+)\\w+

() parentheses create what is known as a "capture group." This is useful because it allows us to reference things in the original string later (that's what the \\1 is in the replacement pattern later on).


Inside the capture group we have the character class \\d which is any digit 0–9, with the modifier + which means one or more, so we will capture any non-zero number of digits (the modifier * means "zero or more" so it would capture the rest even in the absence of a digit in that position), then we have the string literal _, the underscore, then this repeats since we want two digits and underscores.

S1_2_ S1_test.txt

Incidentally, you could write the capture group as,


Where the outer () create the capture group we want, the inner () create another group, and the {} are a modifier saying to match this pattern exactly twice.


Finally, the last \\w+ captures as many remaining word characters as possible. It stops before the . since that is not a "word" character.

S1_2_** S1_test**.txt

So the entire pattern captures:

S1_2_ S1_test.txt

and saves 1_2_ as "capture group" 1.

Then, in the replacement pattern we use,




\1 is the first regex capture group, in this case 1_2_ (note we need to escape it in R as well so we write \\1).


new is just the literal string new.

So, we replace everything before .txt in the file name with our capture group and the string literal new.

S1_2_S1_test.txt becomes 1_2_new.txt

If you want to learn more about regular expressions, I particularly like the site https://regex101.com.

Here is a link to the regex101 with your sample file names and this regex already entered. It does a great job providing an explanation of the regex and the Quick Reference section is invaluable if you don't write a lot of regex often.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.