trent
February 20, 2020, 1:56am
1
Hi all.
I've got a large dataset with a field of drug names and ATC categories, as so:
>head(unique(state$drug))
[1] "Amphotericin B (A01AB04)" "Nystatin (A07AA02)" "Clotrimazole (G01AF02)" "Doxycycline (J01AA02)" "Ampicillin (J01CA01)"
[6] "Amoxicillin (J01CA04)"
The pattern is the same - string name (maybe, maybe not including spaces, see "Ampho B" above), followed by a space, followed by seven characters in parentheses.
What I don't want:
>head(str_trunc(state$drug, 10, side = "right"))
[1] "Amphote..." "Nystati..." "Clotrim..." "Doxycyc..." "Ampicil..." "Amoxici..."
> head(str_trunc(state$drug, 13, side = "left"))
[1] "... (A01AB04)" "... (A07AA02)" "... (G01AF02)" "... (J01AA02)" "... (J01CA01)" "... (J01CA04)"
This is the inverse of what I want (without the ellipsis)
> head(str_split_fixed(state$drug, " \\(", n=2))
[,1] [,2]
[1,] "Amphotericin B" "A01AB04)"
[2,] "Nystatin" "A07AA02)"
[3,] "Clotrimazole" "G01AF02)"
[4,] "Doxycycline" "J01AA02)"
[5,] "Ampicillin" "J01CA01)"
[6,] "Amoxicillin" "J01CA04)"
This would also do, if the second string still included the opening parenthesis, or omitted the final one.
What am I missing?
Thanks in advance.
Hi, @trent , please see FAQ: What's a reproducible example (`reprex`) and how do I do one? . They are very helpful.
I have probably misunderstood the question as to what is wanted and what is unwanted. Assuming what is wanted is "(A01AB04")
library(stringr)
vec <- c("Amphotericin B (A01AB04)","Nystatin,(A07AA02)","Clotrimazole,(G01AF02)","Doxycycline,(J01AA02)","Ampicillin,(J01CA01)")
pattern <- "\\(.*\\)$"
str_extract(vec,pattern)
#> [1] "(A01AB04)" "(A07AA02)" "(G01AF02)" "(J01AA02)" "(J01CA01)"
Created on 2020-02-19 by the reprex package (v0.3.0)
Is this what you want?
library(stringr)
sample_text <- c("Amphotericin B (A01AB04)", "Nystatin (A07AA02)", "Clotrimazole (G01AF02)",
"Doxycycline (J01AA02)", "Ampicillin (J01CA01)", "Amoxicillin (J01CA04)")
str_match(sample_text, "(.+)\\s+(\\(.+\\))")
#> [,1] [,2] [,3]
#> [1,] "Amphotericin B (A01AB04)" "Amphotericin B" "(A01AB04)"
#> [2,] "Nystatin (A07AA02)" "Nystatin" "(A07AA02)"
#> [3,] "Clotrimazole (G01AF02)" "Clotrimazole" "(G01AF02)"
#> [4,] "Doxycycline (J01AA02)" "Doxycycline" "(J01AA02)"
#> [5,] "Ampicillin (J01CA01)" "Ampicillin" "(J01CA01)"
#> [6,] "Amoxicillin (J01CA04)" "Amoxicillin" "(J01CA04)"
dromano
February 20, 2020, 4:21am
4
Here's a possible solution that uses the tidyverse
package:
library(tidyverse)
vec <-
c("Amphotericin B (A01AB04)",
"Nystatin (A07AA02)",
"Clotrimazole (G01AF02)",
"Doxycycline (J01AA02)",
"Ampicillin (J01CA01)"
)
tibble(vec) %>%
separate(vec, into = c('name', 'code'), sep = " \\(" ) %>%
mutate(code = paste0('(', code))
#> # A tibble: 5 x 2
#> name code
#> <chr> <chr>
#> 1 Amphotericin B (A01AB04)
#> 2 Nystatin (A07AA02)
#> 3 Clotrimazole (G01AF02)
#> 4 Doxycycline (J01AA02)
#> 5 Ampicillin (J01CA01)
Created on 2020-02-19 by the reprex package (v0.3.0)
system
Closed
March 12, 2020, 4:21am
5
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.