Dong
September 4, 2018, 11:53pm
1
Hi,
I am trying to use extract
to handle optional substrings. For example,
library(tidyverse)
df <- as.tibble(c(
"A = X0 -> X1",
"B = Y1" ))
df %>% extract(value, into = c("variable", "v0", "v1"),
regex = "(\\w+) = (\\w+) -> (\\w+)")
#> # A tibble: 2 x 3
#> variable v0 v1
#> <chr> <chr> <chr>
#> 1 A X0 X1
#> 2 <NA> <NA> <NA>
Created on 2018-09-04 by the reprex package (v0.2.0).
I want to be able to match the second row and extract vlaue "B" and "Y1" to "variable" and "v1", and leave v0 empty (NA).
This is probably a general regex question beyond my skill level. Please share your suggestions/solutions.
Thanks,
Dong
markdly
September 5, 2018, 12:56am
2
Hi Dong,
For me, I'd tackle this problem by using separate()
a couple of times instead of extract()
.
library(tidyverse)
df <- as.tibble(c(
"A = X0 -> X1",
"B = Y1" ))
df %>%
separate(value, into = c("variable", "rhs"), " = ") %>%
separate(rhs, into = c("v0", "v1"), " -> ", fill = "left")
#> # A tibble: 2 x 3
#> variable v0 v1
#> <chr> <chr> <chr>
#> 1 A X0 X1
#> 2 B <NA> Y1
Created on 2018-09-05 by the reprex package (v0.2.0).
Personally I prefer this slightly more verbose approach as it allows me to keep the regexes simpler too (as my regex skills certainly aren't the greatest)!
6 Likes
Dong
September 5, 2018, 5:18am
3
Wow! Nice and simple. Many thanks!
I did not know fill = "left"
before, which appears to be the key here.
Still wondering how regex would handle it...
1 Like
cderv
September 5, 2018, 7:24pm
4
Yes you can do that with a regex:
library(tidyverse)
df <- as.tibble(c(
"A = X0 -> X1",
"B = Y1" ))
df %>% extract(value, into = c("variable", "v0", "v1"),
regex = "(\\w+) = (\\w+)(?: -> (\\w+))?")
#> # A tibble: 2 x 3
#> variable v0 v1
#> <chr> <chr> <chr>
#> 1 A X0 X1
#> 2 B Y1 <NA>
Created on 2018-09-05 by the reprex package (v0.2.0).
The two additional trick I used:
(?: ...)
is for a group that is not matched in extraction
(...)?
is for making a group optional
That way (?: -> (\\w+))?
matches only if it exists any word after ->
3 Likes
markdly
September 5, 2018, 11:34pm
5
Nice example @cderv ! I adjusted your regex slightly so the optional group (?: ...)?
includes the (\\w+)
term before ->
rather than after. This puts NA in the v0 column and Y1 in the V1 column (which I think is the desired output):
library(tidyverse)
df <- as.tibble(c(
"A = X0 -> X1",
"B = Y1" ))
df %>% extract(value, into = c("variable", "v0", "v1"),
regex = "(\\w+) = (?:(\\w+) -> )?(\\w+)")
#> # A tibble: 2 x 3
#> variable v0 v1
#> <chr> <chr> <chr>
#> 1 A X0 X1
#> 2 B <NA> Y1
Created on 2018-09-06 by the reprex package (v0.2.0).
(BTW, I'm not really familiar with using option groups so I learnt a lot by trying to tweak your regex example - thanks for putting up as a solution!)
4 Likes
Dong
September 5, 2018, 11:38pm
6
Thank you both @markdly and @cderv . I certainly learned quite a bit from you.
cderv
September 6, 2018, 5:38am
7
Oh you're right, it is the first part that is missing ! Good catch !
Regex is powerful and there several solution to achieve one extraction. It depends on how generic or specific the regex should be here
Glad I could help !
cderv
September 6, 2018, 5:38am
8
If your question's been answered, would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:
If your question has been answered, don't forget to mark the solution!
How do I mark a solution?
Find the reply you want to mark as the solution and look for the row of small gray icons at the bottom of that reply. Click the one that looks like a box with a checkmark in it:
[image]
Hovering over the mark solution button shows the label, "Select if this reply solves the problem". If you don't see the mark solution button, try clicking the three dots button ( ••• ) to expand the full set of options.
When a solution is chosen, the icon turns green and the hover label changes to: "Unselect if this reply no longer solves the problem". Success!
[solution_reply_author]
…