Parallilize a random forest code

nikos_geo · February 14, 2024, 3:27pm

I have created the following code to where I calculate and store the r-squared of 18 random forest models:

library(pacman)
pacman::p_load(terra, atakrig, parallel, doParallel, tools, fs, dplyr, rfUtilities, VSURF, data.table, tidymodels, foreach, doParallel, ranger, randomForest)

wd <- "path/"
mwd <- "path/la/"

vectList <- list.files(path = paste0(mwd), pattern = "la_small_3309.shp$", all.files = TRUE, 
                       full.names = TRUE)

v <- terra::vect(vectList)
plot(v)
provoliko <- crs(v)

eq1 <- ntl ~ .

########################################## big folder ##########################################
# Load the data
df_big <- read.csv(paste0(wd, "block.data.psf.csv"))
# sint <- subset(block.data, select = c(x, y))
df_big <- df_big[, 3:ncol(df_big)]

########################################## small folder ##########################################
###################### read the csv containing the coarse res data ######################
df_small <- fread("path/block.data.psf.csv")
sint <- subset(df_small, select = c(x, y))
df_small <- df_small[, 3:ncol(df_small)]

# for reproduciblity
set.seed(123)
r2.df <- NULL

foreach (i = seq(030, 200, by = 10)) %do% {
  std <- sprintf("%03.0f", i)
  
  print(paste("Running for ", std))
  
  column_names_for <- names(df_big)[grepl(std, names(df_big))]
  
  testVect = c("ntl", 
               column_names_for)
  
  subBlockData <- subset(df_big, select = testVect)
 
  set.seed(234)
  ames_split <- initial_split(subBlockData, prop = .8, strata = "ntl")
  ames_train <- training(ames_split)
  ames_test  <- testing(ames_split)
  
  # for reproduciblity
  set.seed(345)
  features <- setdiff(names(ames_train), "ntl")
   
  m1 <- ranger(
    formula = eq1,
    data    = ames_train,
    keep.inbag = TRUE, 
    write.forest = TRUE,
    num.threads = 15,
    num.trees = 2501
  )
  
  num_trees <- m1$num.trees
  predictions <- matrix(nrow = num_trees, ncol = nrow(ames_train))
  mse <- numeric(num_trees)
  
  for(i in 1:num_trees){
    pred <- predict(m1, 
                    data = ames_train, 
                    num.trees = i)$predictions
    mse[i] <- mean((pred - ames_train$ntl)^2)
  }
  
  btree <- which.min(mse)
  
  if((btree %% 2) == 0) {
    btree <- btree + 1
    print(paste("The new btree is ", btree))
  } else {
    print(paste(btree, "is even"))
  }
  
  ames_ranger <- ranger(
    formula   = eq1,
    data      = ames_train,
    num.trees = btree,
    mtry      = floor(length(features) / 3),
    num.threads = 15
  )

  # ds_small <- !!!!!!!!!!!!!!!!!!!! edo vazoume to csv small me stiles ntl kai std !!!!!!!!!!!!!!!!!!!!
  df_small_for <- copy(df_small)
  df_small_for[, ..testVect]

  p <- predict(ames_ranger, df_small_for, type = "response", na.rm = TRUE)

  r_squared <- 1 - sum((df_small_for$ntl - p$predictions)^2) / sum((df_small_for$ntl - mean(df_small_for$ntl))^2)
  r2.df <- rbind(r2.df, data.frame(std = i/100, r2 = r_squared))
}

write.csv(r2.df, "path/r2.csv", row.names = FALSE)

The issue is that every iteration it takes approximately 5 minutes which is very time-consuming. I was wondering why that might be. When I run manually each iteration (in a separate script) it takes way less. One thought is that it's not very efficient to write a for loop inside the foreach (it's just a guess).

I was wondering if there is an alternative/better way to parallelize/speed-up the above code.

A little bit info about the code and what it does:

Initially, I read a csv from a folder called big_folder which contains 291 columns and 6714 rows.
I read a csv from a a folder called small_folder which containg the same number of columns but fewer rows.
Then I create a RF model by taking the column called ntl and all the column that contain the string 030 in their names. I build an initial model with 2501 trees and then (here comes the for loop) I select the number of trees with the lowest mse. Using that number (which I call btree) I fine-tune another model.
Finally, using the fine-tuned model, I make predictions using the csv from the small_folder and I store the r-squared.
The process repeats for the columns containing the string 040, 050, ..., 200 in their names.

R 4.3.2, RStudio 2023.12.1 Build 402 , Windows 11.

nirgrahamuk · February 14, 2024, 5:11pm

what process are you doing that is time consuming, can be done in parallel, and is not the ranger() call ?

because it seems like you are using ranger with a high num.threads, and so that is as parallelised as you can make it

nikos_geo · February 14, 2024, 7:57pm

As I said, I believe there is some sort of conflict in the way I use the for inside the foreach. If I do the process one but one, without using loops, it runs way way faster.

nikos_geo · February 16, 2024, 10:05am

I think I found the issue. I lowered the number of threads inside the ranger function from 15 to 12 and change the foreach to for loop.

I don't know why, but this made the execution at least 10 times faster.

michaelmayer · February 19, 2024, 4:05pm

Glad you found a solution/workaround to your issue.

Just curious: Is there any reason that you actually define num.threads for the ranger call ? By default it will use the number of cores available, so if you are back to a simple for loop, you may even want to remove num.threads completely.

In case you wanted to still go back to the foreach / %dopar% approach to see if you can do even better than now, you could investigate setting num.threads to the number of cores divided by a number x (number of cores must be divisble by x) - this number x you then could use as the number of parallel threads your foreach loop is allowed to spawn, e.g. with doSNOW (cf. https://cran.r-project.org/web/packages/doSNOW/doSNOW.pdf)

Bit of pseudo code

# let's assume cores = 12
cores<- parallel::detectCores() 
# 12 divides 4 
x<-4 
cl <- makeCluster(x, type="SOCK")
registerDoSNOW(cl) 

foreach(...) %dopar% {
...
ranger(...,num_threads=cores/x
...
}

So, in the above case, the foreach loop will spawn 4 processes and if it comes to the ranger call each ranger call will spawn (12/4)=3 threads to make it a total of 3*4 threads that then will ideally use your windows server (ensuring that the mapping between compute intensive threads and cpus is 1:1)

You also may want to analyse your code with a tool like profvis to find out which part of the code is taking the most time (cf. Profvis — Interactive Visualizations for Profiling R Code) - if you compare your previous slow code with the current code via profvis you will see where the bottleneck is/was.

On a related note: I notice that the foreach loop and the for loop within foreach both use i as an index variable. It may be better to rename one of the index variables - unforeseeable things can happen otherwise).

nikos_geo · February 25, 2024, 11:17am

Great piece of advice. I will have a look at it asap.

system · March 3, 2024, 11:18am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.