Selecting all text after first instance of specific pattern in string

Hi everyone,

so I have some messy data that looks like this after importing:

string <- "\n                \n                  FIRSTNAME LASTNAME\n                  AFFILIATION\n\n                  ADDRESS LINE 1\n                  ADDRESS LINE 2\n                  ADDRESS LINE 3\n\n                  CITY,\n                  STATE\n                  POSTCODE\n                  COUNTRY\n                  \n                  \n\n                  Phone: 123456789\n                  EMAIL@ADDRESS"

Created on 2023-08-06 with reprex v2.0.2

I want to isolate the address (i.e. everything from ADDRESS LINE 1 through to COUNTRY). There may or may not be a phone/email afterwards and the specifics of the address vary between each person (e.g. they may not have entered a country or postcode). So just to clarify, my desired output would be something like this:

c("ADDRESS LINE 1 ADDRESS LINE 2 ADDRESS LINE 3 CITY STATE POSTCODE COUNTRY")

Within every string, the address is preceded by \n\n, so I was hoping to create a regrex that could isolate everything after the first \n\n and then do some further cleaning after that to isolate the address but I'm not having any luck figuring it out. This is the closest I've got, but I can't figure out how to get the rest of the string:

stringr::str_extract(string, pattern = "\\n\\n(.*)")
#> [1] "\n\n                  ADDRESS LINE 1"

Created on 2023-08-06 with reprex v2.0.2

dbl_nl <- "\n\n"
multis <- "\\s{3,}"

string <- "\n                \n                  FIRSTNAME LASTNAME\n                  AFFILIATION\n\n                  ADDRESS LINE 1\n                  ADDRESS LINE 2\n                  ADDRESS LINE 3\n\n                  CITY,\n                  STATE\n                  POSTCODE\n                  COUNTRY\n                  \n                  \n\n                  Phone: 123456789\n                  EMAIL@ADDRESS"

result <- strsplit(string,dbl_nl)[[1]][2:3] |>
  gsub(multis," ",x = _) |> 
  trimws() 

paste(result[1],result[2])
#> [1] "ADDRESS LINE 1 ADDRESS LINE 2 ADDRESS LINE 3 CITY, STATE POSTCODE COUNTRY"

Created on 2023-08-07 with reprex v2.0.2

1 Like

This is exactly what I'm after, thank you technocrat! It seems to have worked for all test cases so far so fingers crossed...

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.