Hello,
I've been really struggling with extracting the body of water when given a column of station names in a data frame. Here's an example set of data.
df <- data.frame(Station = c("Santa Cruz Creek at Doe Ave",
"Mendocino Creek below Oroville Reservoir",
"Banos Stream along Foothill Drive",
"San Diego Creek by San Diego Creek Trail", "San Mateo Creek at Creekside",
"Los Angeles River below Santa Clara River"))
I've been using this piece of code to extract the body of water.
df %>% mutate(waterbody = str_extract(Station, "[\\w\\s]+(Creek|Stream|River)"))
Most of the time, it works pretty well in extracting the body of water's name and designation, but unfortunately, when there's more than one instance of creek/stream/river in the station name, it has a tendency to capture too much.
As you can see, it works well for the first three stations, but for the last three stations, it captures everything before the last Creek/Stream/River. I've really been struggling to find a solution to this issue and was wondering how I could fix my str_extract to obtain the proper information.