num_workers in data_loader from torch does not seem to parallelize batch loading

Hi there,

I want train cnns on a big dataset via transfer learning using torch in R. Since my dataset is to big to be loaded all at once, I have to load each sample from the SSD in the dataloader. But loading one batch from my SSD takes about 20x the time as processing (forward pass, back prop, optimizing) it. Therefore asynchronus parallel data loading would be advisable.

As far as I understand torch, this can be done in the dataloader via the `num_workers - parameter. But using that did not decrease the loading time of a batch in the trainingsloop, except from introducing a big overhead bevor the first batch is gathered (probably there the workers are created).

Example:

library(torchvision)
library(torch)

dl<-torchvision::image_folder_dataset(
  root="./data/processed/satalite_images/to_use",
  loader=function(path){
     # I have images of size 299x299 with 13 channels.
    # optimizing this loading step yielded no significant improvement.
    return(array(readRDS(path), dim=c(13,299,299))*1.0)
  },
  target_transform = function(x){a<-c(0.0,1.0)[x];dim(a)<-1;return(a)}
)
#Here I set num_workers to different numbers, but that did not change the loading time
dl2<-torch::dataloader(dl, batch_size=110L, shuffle = T, num_workers = 15L)
#just a random pretrained model for transfer learning
model_torch = torchvision::model_alexnet(pretrained = T)
model_torch$parameters |>
  purrr::walk(function(param) param$requires_grad_(FALSE))

# replacing the last layer to my desired classifier

inFeat =model_torch$classifier$'6'$in_features
model_torch$classifier$'6' = nn_linear(inFeat, out_features = 1L)

# I have 13 input channels, therefore I replace the first conv layer with a equivialent one but with 13 input channels
conv1<-torch::nn_conv2d(in_channels=13L, out_channels=model_torch[[1]]$`0`$out_channels, 
                        kernel_size =model_torch[[1]]$`0`$kernel_size , 
                        stride = model_torch[[1]]$`0`$stride,
        padding =model_torch[[1]]$`0`$padding, 
        dilation = model_torch[[1]]$`0`$dilation, groups = model_torch[[1]]$`0`$groups, bias = TRUE)
model_torch[[1]]$`0`<-conv1

model_torch<-model_torch$to(device = "cuda")
opt = optim_adam(params = model_torch$parameters, lr = 0.01)

#trainings loop
for(e in 1:1){
  losses = c()
#storing the time which the loop uses for computing and data loading
  end<-Sys.time()
  coro::loop(
    for(batch in dl2){
      start<-Sys.time()
      #this is the time it takes to load a batch
      print(start-end)
      print("computing")
      opt$zero_grad()
      pred = model_torch(batch[[1]]$to(device="cuda"))
      res=batch[[2]]$to(device = "cuda")
      loss = nnf_binary_cross_entropy(input=torch_sigmoid(pred),target=res)
      loss$backward()
      opt$step()
      losses = c(losses, loss$item())
      end<-Sys.time()
      #this is the time it takes to process a batch
      print(end-start)
      print("loading")
    }
  )
}

To my understaning the time it takes to load a batch should (after the first few batches) decrease significantly if I use parallel batch loading through num_workers, compared to num_workers = 0.

But the printed time stays roughly the same no matter the number of workers used.

I would be glad if anyone could help me!