Hello, I am working with some heavy databases and I am mostly a loop user which means that my code takes ages to run...
Here I want to check if each cases has a letter in it and if so I want to report the letter in an another variable and to remove it from the cell.
Here an example and the code :
library(tidyverse)
data <- data.frame(stringsAsFactors = FALSE,
Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5),
Sepal.Width = c("3.5 s", "3", "3.2 s", "3.1 t", "3.6")
)
data$code<-NA
for (i in 1:nrow(data)) {
if (any(str_detect(data[i,ncol(data)-1],letters))) {
data$code[i]<-str_sub(data[i,ncol(data)-1],start = -1)
data[i,ncol(data)-1]<-str_sub(data[i,ncol(data)-1],end = -3)
}else{data$code[i]<-NA}
}
How can we vectorize this loop so that the code runs instantly ?
Here's a method that seems 75x faster; you could also add in the line I commented out to have the number part be a numeric type.
EDIT : I added a third which is a more direct vectorisation of your original code; its twice as fast but not as fast as the tidyverse, split and conquer approach
You already were using str_detect from stringr/tidyverse package.
I wanted to leverage the seperate_wider_delim() function for easy splitting of well delimited variables; I realised that there were two cases, ones that were splittable and ones that were not; Rather than try to deal with differences inline , so to speak, I made the decision to try to divide the data into these two cases up front, and approach them seperately before recombining; this is just something that I learned can work from experience. In general breaking a problem into easier to solve parts is convenient to programmers if only from an understanding and maintenance basis; it turns out to not hurt the speed in this case.
Certainly! You can achieve a much faster and more efficient solution by utilizing the mutate() function from the dplyr package along with the str_extract() function from the stringr package. This approach avoids the need for loops and takes advantage of vectorized operations. Here's how you can do it:
In this code, we use mutate() to create a new column code based on the condition you specified. We use ifelse() to check if a letter exists in the Sepal.Width column. If it does, we extract the last character using str_sub() to get the letter and remove the last two characters from Sepal.Width. If not, we assign NA to the code column.
This vectorized approach will run much faster compared to the loop, especially for large datasets, and is more in line with the tidyverse style of data manipulation.