Extract strings

strings <- c("/run/media/bb/cc/GA/DrRao/JOBS/Edgar filings_full text/Form S-1/10/10_S-1_2013-11-20_0001104659-13-086087.txt", "/run/media/bb/cc/GA/DrRao/JOBS/Edgar filings_full text/Form S-1/1001172/1001172_S-1_2013-01-20_0001104659-13-086087.txt")

I need to extract the number after S-1 between the first / and the second /, which are 10 and 101172, how could I achieve this? Thanks!

You can use a regular expression with a look-behind assertion, which has the form (?<=...). That means "look for text that follows what is in the place of the three dots".

library(stringr)
#> Warning: package 'stringr' was built under R version 3.5.3
strings <- c("/run/media/bb/cc/GA/DrRao/JOBS/Edgar filings_full text/Form S-1/10/10_S-1_2013-11-20_0001104659-13-086087.txt", "/run/media/bb/cc/GA/DrRao/JOBS/Edgar filings_full text/Form S-1/1001172/1001172_S-1_2013-01-20_0001104659-13-086087.txt")
numbers <- str_extract(strings,"(?<=S-1/)\\d+")
numbers
#> [1] "10"      "1001172"

Created on 2019-10-28 by the reprex package (v0.3.0.9000)

2 Likes

Are there any tutorials on understanding the regular expressions?

There are lots of resources online but if you are looking for a book, I would recommend this one

https://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124

Thanks. I think I will stick with the stringr cheat sheet.

This article is more specific on that regard but it is not meant to be a regex tutorial

https://stringr.tidyverse.org/articles/regular-expressions.html

1 Like

You can get in some practice with Regex Golf. It's a common game (the link is just one person's version) with the goal of writing a regular expression to match everything in one list and nothing in another. It's scored by the number of characters in the expression, so it's like golf in that lower scores are better.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.