purrr detect function()

WNA · October 12, 2022, 2:06am

The detect() function seems to be unduly slow. I wonder if I am doing something wrong, or if this slowness is simply part of the function.
In my intended application I am trying to find the first nonzero element in each (row, column) cross section of a (600, 600, 6) array. This computation will be repeated at least 1000 times in a simulation, and more likely 10,000 times. The detecting the first non-zero element is in the innermost loop, so the timing really matters.

In the reprex below I repeat just 10 times, using the detect() function two different ways, using the first() function, and using a handwritten function to do the same thing. The timing difference is dramatic. (3 minutes vs. 7 seconds) Unless I learn to more effectively use detect(), I will use the fourth method.

A couple notes:

The real application does a couple things with the output matrix, but those are pretty fast, and irrelevant here.
The data structure of a 3-dimensional array is natural in the intended application.
I am fully prepared to be told that I am doing something stupid. My prior, based on over 60 years of programming experience, is that stupidity is highly likely.
The computations were performed in R version 4.2.1, and used tidyverse 1.3.2 and purrr version 0.3.4.

# File R code timing minimal
# Bill Anderson -- October 2022

require(tidyverse)
#> Loading required package: tidyverse

########################### Functions to return the first non-zero entry in a numeric vector

# use the detect() function with an an anonymous function
FirstinCrossSection_alta <- function(vecotr){
  vecotr %>% detect(~.x != 0, .default = 0)
} 

# use the detect() function with a named function 
nonzero <- function(x){
  x != 0
}
FirstinCrossSection_altb <- function(vecotr){
  vecotr %>% detect(nonzero, .default = 0)
}

# use  the first() function
FirstinCrossSection_altc <- function(vecotr){
  first(vecotr[vecotr != 0], default = 0)
}

# use a hand-written loop
FirstinCrossSection_altd <- function(vecotr){
  for (i in 1:length(vecotr)){
    if (vecotr[i] != 0) return(vecotr[i])
  }
  return(0)
}

############################ Generate the data
# In a real application these numbers would come from trial data
set.seed(10000)
nsubjects <- 600 # might be up to a few thousand in a real application
ncomponents <- 6  # about as large as this ever would be
nreps <- 10 # make this large enough to get randomness out of the timing
outcomevector <- sample(c(-1, 0, 1), ncomponents*nsubjects^2, prob = c(1, 1, 1), replace = TRUE)
winarray <- array(outcomevector, dim <- c(nsubjects, nsubjects, ncomponents))

########################## Evaluate the various algorithms

# use the detect() function with an an anonymous function
starttime <- Sys.time()
for (i in 1:nreps){
  winmatrixa <- apply(winarray, c(1, 2), FirstinCrossSection_alta)
}
(timea <- Sys.time() - starttime)
#> Time difference of 3.405451 mins

# use the detect() function with a named function 
starttime <- Sys.time()
for (i in 1:nreps){
  winmatrixb <- apply(winarray, c(1, 2), FirstinCrossSection_altb)
}
(timeb <- Sys.time() - starttime)
#> Time difference of 2.566472 mins

# use the first() function
starttime <- Sys.time()
for (i in 1:nreps){
  winmatrixc <- apply(winarray, c(1, 2), FirstinCrossSection_altc)
}
(timec <- Sys.time() - starttime)
#> Time difference of 12.53203 secs

# use a hand-written loop
starttime <- Sys.time()
for (i in 1:nreps){
  winmatrixd <- apply(winarray, c(1, 2), FirstinCrossSection_altd)
}
(timed <- Sys.time() - starttime)
#> Time difference of 7.226184 secs

^{Created on 2022-10-11 with reprex v2.0.2}

nirgrahamuk · October 12, 2022, 10:46am

detect is more general purpose than the other functions, justifying its slower speed.
for example if list(0,"xyz") was the input, the method based on detect would succeed and return "xyz" while the others would fail.
Since your intention is to process Matrix objects, with integer contents, then it makes sense that a more tailored (less general) function would be fast.
I think the following would be competitive with the handwritten loop.

FirstinCrossSection_alte <- function(x){
  xx <- head(which(x!=0),n=1)
  if(identical(integer(0),xx)) return(0)
  return(x[xx])
}

as an additional sidenote; I prefer to use library(bench) for similar benchmarking of this kind; it dispenses with handwritten system time math, repeats timing experiments multiple times for statistical benefit, and by default checks that the functions being compared give equivalent results ( you dont want to optimise at the cost of incorrect outputs)

WNA · October 17, 2022, 3:30am

Nir,

Thanks for the helpful comment. I simply had not realized how much overhead was involved with the detect() function. Your suggestion is still slower than the FirstinCrossSection_altd(); I guess because of the overhead of first() and which().

I am a true tidyverse believer, and I really liked the clean interface provided by detect(). But for this particular simulation I will stick with the fastest code, which still appears to be the function FirstinCrossSection_altd() given above.

system · November 7, 2022, 3:30am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.