It's often the case that I want to use regex to pull parts of strings in a chr column into their own columns. It's similar to separate
but more general. I can get it to work but I find it awkward and I wonder if there's a better way?
df <- tribble(
~filename,
"2008_some_name_author1.xlsx",
"2008_some_name_author2.xlsx",
"2008_some_name_author3.xlsx"
)
pattern <- "(\\d+).*_([^_]*).xlsx"
df %>%
pull(filename) %>%
str_match(pattern)
df %>%
# this is ugly: mutate(author = filename %>% str_match(pattern)[,2])
mutate(year = filename %>% str_match(pattern) %>% as.tibble %>% pull(2),
author = filename %>% str_match(pattern) %>% as.tibble %>% pull(3))
# really want something like:
# regex_separate(filename, into = c("year", "author"), pattern)
I've looked over the str_match documentation fairly closely but I haven't seen a usage example like this one. A couple of things:
- Why str_match and not str_extract (hey, I think of this as "extracting bits of the string"?)
- Is getting the nth column of the matrix the way to use
str_match
in this context? If so anything better than the two options above ([,2]
and%>% as.tibble %>% pull(2)
)? - Any way to do "both at once" as the above code runs the regex twice.
- Anyone think something like
separate
would be useful, using paren capturing rather than splitting? I love that separate is explicit but avoids 'magic numbers' etc. Something like:
df %>%
regex_separate(filename, into = c("year", "author"), pattern)
Or am I going about this entirely the wrong way?