I could use help with cleaning some dirty string data. I want to extract the first ID value). The first 9 rows are precisely what I need. However, rows 10-21 are the problem.
I've tried using functions like str_extract and str_remove from the stringr package, but I can't figure out a pattern to remove the unwanted strings.
Could someone help me with the Stringr and/or Regex formula that will help me achieve this?
Thanks in advance,
James
Reprex
structure(list(con_number = c("CON16552", "CON15607", "CON15607",
"CON014592", "CON012146", "CON014085", "SP00012088", "SP00012088",
"SP00012088", "CON016107/CON017440", "CON016107/CON017440", "CON016107/CON017440",
"CON015304 (primary CON#)", "CON014838 (previous CON)", "CON015304 (primary CON#)",
"CON012407 ( Amendment: CON017074)",
"CON012407 ( Amendment: CON017074)",
"CON012407 ( Amendment: CON017074)",
"CON015103 [CON012429 (CON number for this award - this is a supplement)]",
"CON015103 [CON012429 (CON number for this award - this is a supplement)]",
"CON015103 [CON012429 (CON number for this award - this is a supplement)]"
)), row.names = c(NA, -21L), class = c("tbl_df", "tbl", "data.frame"
), na.action = structure(22:24, names = c("22", "23", "24"), class = "omit"))
# more convenient to work with vectors in dealing
# with a single column data frame
v <- c(
"CON16552", "CON15607", "CON15607",
"CON014592", "CON012146", "CON014085", "SP00012088", "SP00012088",
"SP00012088", "CON016107/CON017440", "CON016107/CON017440", "CON016107/CON017440",
"CON015304 (primary CON#)", "CON014838 (previous CON)", "CON015304 (primary CON#)",
"CON012407 ( Amendment: CON017074)",
"CON012407 ( Amendment: CON017074)",
"CON012407 ( Amendment: CON017074)",
"CON015103 [CON012429 (CON number for this award - this is a supplement)]",
"CON015103 [CON012429 (CON number for this award - this is a supplement)]",
"CON015103 [CON012429 (CON number for this award - this is a supplement)]"
)
# match everything from the first blank through
# the end of the string
omit <- " .*$"
# substitute the target string with nothing
# create a new data frame with the column
# name con_number
d <- data.frame(con_number = gsub(omit,"",v))
d
#> con_number
#> 1 CON16552
#> 2 CON15607
#> 3 CON15607
#> 4 CON014592
#> 5 CON012146
#> 6 CON014085
#> 7 SP00012088
#> 8 SP00012088
#> 9 SP00012088
#> 10 CON016107/CON017440
#> 11 CON016107/CON017440
#> 12 CON016107/CON017440
#> 13 CON015304
#> 14 CON014838
#> 15 CON015304
#> 16 CON012407
#> 17 CON012407
#> 18 CON012407
#> 19 CON015103
#> 20 CON015103
#> 21 CON015103