The problem is that not all the rows in the data frame column contain that matching pattern, so the returned vector is shorter than the data frame itself resulting in an error
Error in $<-.data.frame(*tmp*, Prep, value = c(4L, 22L, 41L, 67L, :
replacement has 685 rows, data has 700
Is there a way to avoid this ? is there a way to return empty string or NA when the searched string doesnt contain the matching words?
Thank you
There's a tidy was to do this. I made a toy example to illustrate
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stringr)
library(tibble)
pattern <- "the preparation was"
phrases <- c("the preparation was", "the outcome will be")
my.df <- enframe(phrases)
colnames(my.df) <- c("index", "phrase")
my.df <- my.df %>% select(-index)
my.df <- my.df %>% filter(phrase == pattern)
my.df
#> # A tibble: 1 x 1
#> phrase
#> <chr>
#> 1 the preparation was
In python, the way I would do this is the following:
try:
the function above ie something similar to Data$Prep <- grep("the preparation was", unlist(strsplit(Data$REPORT_TEXT, '(?<=\\.)\\s+', perl=TRUE)), value=TRUE, ignore.case = TRUE)
except:
someting to return NA in case the above function returns an error because there is no matching text
If "the preparation was" was just an example, rather than a literal, stringr supports regex. I'm not sure I follow your Python example because your original example did include the search string and the problem you were trying to solve was to exclude the records without it. As @andresrcs suggests, a reproducible example, called a reprex would be a great help.