vectorizing loops in R

remsssteack · August 16, 2023, 9:08am

Hello, I am working with some heavy databases and I am mostly a loop user which means that my code takes ages to run...
Here I want to check if each cases has a letter in it and if so I want to report the letter in an another variable and to remove it from the cell.

Here an example and the code :

library(tidyverse)

data <- data.frame(stringsAsFactors = FALSE,
                 Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5),
                 Sepal.Width = c("3.5 s", "3", "3.2 s", "3.1 t", "3.6")
)

data$code<-NA

for (i in 1:nrow(data)) {
  if (any(str_detect(data[i,ncol(data)-1],letters))) {
    data$code[i]<-str_sub(data[i,ncol(data)-1],start = -1)
    data[i,ncol(data)-1]<-str_sub(data[i,ncol(data)-1],end = -3)
  }else{data$code[i]<-NA}
}

How can we vectorize this loop so that the code runs instantly ?

Thank you,

nirgrahamuk · August 16, 2023, 9:57am

Here's a method that seems 75x faster; you could also add in the line I commented out to have the number part be a numeric type.
EDIT : I added a third which is a more direct vectorisation of your original code; its twice as fast but not as fast as the tidyverse, split and conquer approach

library(tidyverse)

datasmall <- data.frame(stringsAsFactors = FALSE,
                   Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5),
                   Sepal.Width = c("3.5 s", "3", "3.2 s", "3.1 t", "3.6")
)

data <- purrr::map_dfr(1:1000,\(x)datasmall)

first_f <- function(data){
data$code<-NA
for (i in 1:nrow(data)) {
  if (any(str_detect(data[i,ncol(data)-1],letters))) {
    data$code[i]<-str_sub(data[i,ncol(data)-1],start = -1)
    data[i,ncol(data)-1]<-str_sub(data[i,ncol(data)-1],end = -3)
  }else{data$code[i]<-NA}
}
data 
}




second_f <- function(data){
  
  grpd <- data |> 
    mutate(rn=row_number()) |> 
    split(~str_detect(Sepal.Width,"[a-z]"))
  grpd$`FALSE` <- grpd$`FALSE` |> mutate(code = NA_character_)
  grpd$`TRUE` <- grpd$`TRUE` |> separate_wider_delim(cols=Sepal.Width,
                                                     delim=" ",
                                                     names=c("Sepal.Width","code"))
  bind_rows(grpd) |> 
    # mutate(Sepal.Width=readr::parse_number(Sepal.Width)) |>
    arrange(rn) |> select(-rn)
  
}

third_f <- function(data){
  data$code<-NA
  testvec <- sapply(data[,ncol(data)-1],\(x)any(str_detect(x,letters)))
  data$code[testvec] <-str_sub(data[testvec,ncol(data)-1],start = -1)
  data[testvec,ncol(data)-1]<-str_sub(data[testvec,ncol(data)-1],end = -3)
  data
}

library(bench)
mark(
  first_f(data),
second_f(data),
third_f(data)
)

# A tibble: 3 × 13
  expression          min   median `itr/sec` mem_alloc 
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt> 
1 first_f(data)   875.4ms  875.4ms      1.14  231.98MB 
2 second_f(data)   11.7ms   12.2ms     80.2     1.17MB 
3 third_f(data)     378ms  382.6ms      2.61    1.78MB

remsssteack · August 16, 2023, 10:36am

Thank you so much, I knew it was slow but not THAT slow....

Further do you have a resource to learn about the so called tidyverse, split and conquer approach ?

Thank you so much again!

nirgrahamuk · August 16, 2023, 2:02pm

I learnt tidyverse from this resource :
r4ds link

You already were using str_detect from stringr/tidyverse package.
I wanted to leverage the seperate_wider_delim() function for easy splitting of well delimited variables; I realised that there were two cases, ones that were splittable and ones that were not; Rather than try to deal with differences inline , so to speak, I made the decision to try to divide the data into these two cases up front, and approach them seperately before recombining; this is just something that I learned can work from experience. In general breaking a problem into easier to solve parts is convenient to programmers if only from an understanding and maintenance basis; it turns out to not hurt the speed in this case.

Asif478360 · August 16, 2023, 2:05pm

remsssteack:

Hello, I am working with some heavy databases and I am mostly a loop user which means that my code takes ages to run...
Here I want to check if each cases has a letter in it and if so I want to report the letter in an another variable and to remove it from the cell.

Here an example and the code :
library(tidyverse)

data <- data.frame(stringsAsFactors = FALSE,
                 Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5),
                 Sepal.Width = c("3.5 s", "3", "3.2 s", "3.1 t", "3.6")
)

data$code<-NA

for (i in 1:nrow(data)) {
  if (any(str_detect(data[i,ncol(data)-1],letters))) {
    data$code[i]<-str_sub(data[i,ncol(data)-1],start = -1)
    data[i,ncol(data)-1]<-str_sub(data[i,ncol(data)-1],end = -3)
  }else{data$code[i]<-NA}
}
How can we vectorize this loop so that the code runs instantly ?

Certainly! You can achieve a much faster and more efficient solution by utilizing the mutate() function from the dplyr package along with the str_extract() function from the stringr package. This approach avoids the need for loops and takes advantage of vectorized operations. Here's how you can do it:

rCopy code

library(dplyr)
library(stringr)

data <- data.frame(stringsAsFactors = FALSE,
                   Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5),
                   Sepal.Width = c("3.5 s", "3", "3.2 s", "3.1 t", "3.6")
)

data <- data %>%
  mutate(code = ifelse(str_detect(Sepal.Width, letters),
                       str_sub(Sepal.Width, start = -1),
                       NA_character_),
         Sepal.Width = ifelse(str_detect(Sepal.Width, letters),
                              str_sub(Sepal.Width, end = -3),
                              Sepal.Width))

In this code, we use mutate() to create a new column code based on the condition you specified. We use ifelse() to check if a letter exists in the Sepal.Width column. If it does, we extract the last character using str_sub() to get the letter and remove the last two characters from Sepal.Width. If not, we assign NA to the code column.

This vectorized approach will run much faster compared to the loop, especially for large datasets, and is more in line with the tidyverse style of data manipulation.

system · August 23, 2023, 2:06pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.