Extract text between headings

ricdob · April 20, 2022, 11:23am

Hey Guys!
I´m trying to extract parts of text between different heading. The Headings are starting with "Item 1-15. Title".
I started finding the matching pattern to get the "item" part: str_extract_all(a, "(Item\s\d+\.[:blank:])".

Just can´t get it to extract the whole text between those headings.

Already thanks for the help!

nirgrahamuk · April 20, 2022, 12:12pm

is this HTML ? as the text of interest seems to be a different font style than the other, I would probably use the associated tags to get to the content rather than treating it as a singular text to cut up with regular expressions

ricdob · April 20, 2022, 12:18pm

Yes, it´s an HTML file. I convertet it to the text format using htm2txt in R. I´m now trying to seperate all the text blocks and sort them by title after. Still need to extract the text blocks first I guess.

system · May 11, 2022, 12:18pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.