I have created the following code to where I calculate and store the r-squared of 18 random forest models:
library(pacman)
pacman::p_load(terra, atakrig, parallel, doParallel, tools, fs, dplyr, rfUtilities, VSURF, data.table, tidymodels, foreach, doParallel, ranger, randomForest)
wd <- "path/"
mwd <- "path/la/"
vectList <- list.files(path = paste0(mwd), pattern = "la_small_3309.shp$", all.files = TRUE,
full.names = TRUE)
v <- terra::vect(vectList)
plot(v)
provoliko <- crs(v)
eq1 <- ntl ~ .
########################################## big folder ##########################################
# Load the data
df_big <- read.csv(paste0(wd, "block.data.psf.csv"))
# sint <- subset(block.data, select = c(x, y))
df_big <- df_big[, 3:ncol(df_big)]
########################################## small folder ##########################################
###################### read the csv containing the coarse res data ######################
df_small <- fread("path/block.data.psf.csv")
sint <- subset(df_small, select = c(x, y))
df_small <- df_small[, 3:ncol(df_small)]
# for reproduciblity
set.seed(123)
r2.df <- NULL
foreach (i = seq(030, 200, by = 10)) %do% {
std <- sprintf("%03.0f", i)
print(paste("Running for ", std))
column_names_for <- names(df_big)[grepl(std, names(df_big))]
testVect = c("ntl",
column_names_for)
subBlockData <- subset(df_big, select = testVect)
set.seed(234)
ames_split <- initial_split(subBlockData, prop = .8, strata = "ntl")
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
# for reproduciblity
set.seed(345)
features <- setdiff(names(ames_train), "ntl")
m1 <- ranger(
formula = eq1,
data = ames_train,
keep.inbag = TRUE,
write.forest = TRUE,
num.threads = 15,
num.trees = 2501
)
num_trees <- m1$num.trees
predictions <- matrix(nrow = num_trees, ncol = nrow(ames_train))
mse <- numeric(num_trees)
for(i in 1:num_trees){
pred <- predict(m1,
data = ames_train,
num.trees = i)$predictions
mse[i] <- mean((pred - ames_train$ntl)^2)
}
btree <- which.min(mse)
if((btree %% 2) == 0) {
btree <- btree + 1
print(paste("The new btree is ", btree))
} else {
print(paste(btree, "is even"))
}
ames_ranger <- ranger(
formula = eq1,
data = ames_train,
num.trees = btree,
mtry = floor(length(features) / 3),
num.threads = 15
)
# ds_small <- !!!!!!!!!!!!!!!!!!!! edo vazoume to csv small me stiles ntl kai std !!!!!!!!!!!!!!!!!!!!
df_small_for <- copy(df_small)
df_small_for[, ..testVect]
p <- predict(ames_ranger, df_small_for, type = "response", na.rm = TRUE)
r_squared <- 1 - sum((df_small_for$ntl - p$predictions)^2) / sum((df_small_for$ntl - mean(df_small_for$ntl))^2)
r2.df <- rbind(r2.df, data.frame(std = i/100, r2 = r_squared))
}
write.csv(r2.df, "path/r2.csv", row.names = FALSE)
The issue is that every iteration it takes approximately 5 minutes which is very time-consuming. I was wondering why that might be. When I run manually each iteration (in a separate script) it takes way less. One thought is that it's not very efficient to write a for
loop inside the foreach
(it's just a guess).
I was wondering if there is an alternative/better way to parallelize/speed-up the above code.
A little bit info about the code and what it does:
-
Initially, I read a csv from a folder called big_folder which contains 291 columns and 6714 rows.
-
I read a csv from a a folder called small_folder which containg the same number of columns but fewer rows.
-
Then I create a RF model by taking the column called ntl and all the column that contain the string 030 in their names. I build an initial model with 2501 trees and then (here comes the
for
loop) I select the number of trees with the lowest mse. Using that number (which I call btree) I fine-tune another model. -
Finally, using the fine-tuned model, I make predictions using the csv from the small_folder and I store the r-squared.
-
The process repeats for the columns containing the string 040, 050, ..., 200 in their names.
R
4.3.2, RStudio
2023.12.1 Build 402 , Windows 11.