Setting seed when using future

DBScan · October 23, 2024, 9:08am

Hi, I have some basic question regarding seeds when using parallelization. Suppose I would like to create a new column named SUM based on the numeric columns from the iris dataset:

library(tidyverse)
library(future)
library(furrr)

sum_function <- function(a, b, c, d){
  return (a + b + c + d + rnorm(1))
}

plan("multisession", workers = 1)

set.seed(42, kind = "L'Ecuyer-CMRG")

new_iris <- iris %>% 
  mutate(SUM = pmap(.l = list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
                    .f = sum_function))

future_iris <- iris %>% 
  mutate(SUM = future_pmap(.l = list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
                    .f = sum_function, 
                    .options = furrr_options(seed = 42)))

Unfortunately, the dataframes differ in their value in the SUM column. How can I fix this?

eric-hunt · October 31, 2024, 12:11am

I think you'll need to set the seed within your sum function.

Edit:
Actually, @DBScan, I apologize that is kind of hacky advice for making parallel processing behave like purrr's sequential processing. I think furrr uses a different random number generation algorithm for working in parallel and you won't ever observe purrr and furrr behaving in the same way there. In your case, passing the furrr_options(seed = 42) ensures your generation of random numbers within your furrr processing returns equivalent results whether you use plan(sequential) or plan(multisession).

DBScan · November 6, 2024, 2:12pm

Indeed, this approach is working. But what's the point of setting seeds via furrr_options then?

nirgrahamuk · November 7, 2024, 1:21pm

The point is for reproducibility. So you can run the same furrr code again and get the same result.

system · February 5, 2025, 1:21pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.