Help with Regex to Split Address Column into Multiple Variables in R (Handling Edge Cases)

Hi everyone!

I have a column of addresses that I need to split into three components:

  1. no_logradouro – the street name (can have multiple words)
  2. nu_logradouro– the number (can be missing or 'SN' for "sem número")
  3. complemento– the complement (can include things like "CASA 02" or "BLOCO 02")

Here’s an example of a single address:

RUA DAS ORQUIDEAS 15 CASA 02

It should be split into:

# no_logradouro:  "RUA DAS ORQUIDEAS"

# nu_logradouro:  "15"

# complemento:  "CASA 02"

I am using the following regex inside R:

"^(.+?)(?:\\s+(\\d+|SN))(.*)$"

Which works for simple cases like:

"RUA DAS ORQUIDEAS 15 CASA 02"

However, when I test it on a larger set of examples, the regex doesn't handle all cases correctly. For instance, consider the following:

resultado <- str_match(
c("AV 12 DE SETEMBRO 25 BLOCO 02",
"RUA JOSE ANTONIO 132 CS 05",
"AV CAXIAS 02 CASA 03",
"AV 11 DE NOVEMBRO 2032 CASA 4",
"RUA 05 DE OUTUBRO 25 CASA 02",
"RUA 15",
"AVENIDA 3 PODERES"),
"^(.+?)(?:\\s+(\\d+|SN))(.*)$"
)

Which gives us the following output:

structure(c("AV 12 DE SETEMBRO 25 BLOCO 02", "RUA JOSE ANTONIO 132 CS 05",
"AV CAXIAS 02 CASA 03", "AV 11 DE NOVEMBRO 2032 CASA 4", "RUA 05 DE OUTUBRO 25 CASA 02",
"RUA 15", "AVENIDA 3 PODERES", "AV", "RUA JOSE ANTONIO", "AV CAXIAS",
"AV", "RUA", "RUA", "AVENIDA", "12", "132", "02", "11", "05",
"15", "3", " DE SETEMBRO 25 BLOCO 02", " CS 05", " CASA 03",
" DE NOVEMBRO 2032 CASA 4", " DE OUTUBRO 25 CASA 02", "", " PODERES"),
dim = c(7L, 4L), dimnames = list(NULL, c("address", "no_logradouro",
"nu_logradouro", "complemento")))

As you can see, the regex doesn’t work correctly for addresses such as:

"AV 12 DE SETEMBRO 25 BLOCO 02"`
"RUA 15"
"AVENIDA 3 PODERES"

The expected output would be:

# 1. "AV 12 DE SETEMBRO 25 BLOCO 02" → no_logradouro: "AV 12 DE SETEMBRO";  nu_logradouro: "25";  complemento: "BLOCO 02".

# 2. "RUA 15" → no_logradouro: "RUA 15";  nu_logradouro: "";  complemento: "".

# 3. "AVENIDA 3 PODERES" →  no_logradouro:  "AVENIDA 3 PODERES";  nu_logradouro: "" ; complemento: "".

How can I adapt my regex to handle these edge cases?

Thanks a lot for your help!

This is cross posted and has an answer.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.