extracting specific part from file name

ricdob · March 8, 2022, 9:48am

Hey guys,

Is there an easy way to extract an specific part from file names?
1800_10-Q_2012-05-08_0001104659-12-034444
I´m trying to extract the whole part from _ (0001104659-12-034444).
My code so far: str_extract(df$document, pattern = "(?<=\-)\d+(?=\.)")
Result: 034444
Can´t manage to get it to extract the whole number at the end. If I use "_" as start point it just gets me "NA"

Thanks a lot for all help!

ricdob · March 8, 2022, 10:00am

Managed to get it to work:
str_extract(df$document,'([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][-][0-9][0-9][-])\d+(?=\.txt)')
Might not be the most elegant, but at least it´s working

andresrcs · March 9, 2022, 12:15am

This regex is much simpler

library(stringr)

str_extract("1800_10-Q_2012-05-08_0001104659-12-034444", "[^_]+$")
#> [1] "0001104659-12-034444"

^{Created on 2022-03-08 by the reprex package (v2.0.1)}

ricdob · March 9, 2022, 9:31am

Thanks a lot,
actually your regex gets me a result including .html (which I forgot to mention in my first answer). I´ll try to adjust it for my purpose
38079_10-Q_2006-05-10_0001104659-06-033149.html <- thats the whole part sorry my bad

andresrcs · March 10, 2022, 12:26am

In case you need it

library(stringr)

str_extract(c("1800_10-Q_2012-05-08_0001104659-12-034444.html",
              "1800_10-Q_2012-05-08_0001104659-12-034444.txt"),
            pattern = "[^_]+(?=\\..+$)")
#> [1] "0001104659-12-034444" "0001104659-12-034444"

^{Created on 2022-03-09 by the reprex package (v2.0.1)}

system · March 17, 2022, 12:27am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.