How make more fast this loop for extract meta data with exiftoolr?

M_AcostaCH · February 12, 2024, 3:02pm

Hi community, I have this loop to extract only the filezise from many files. But the total files are like 300,000 with .hips format. So, in a simple rule of three I am getting all the time it is 5.15 days to run the code.

Is there any way to make this loop faster, maybe with map family functions?
Others?

With this function I first get all the directories, then I create the path entering the blobs folder and there I select only the .hips files. After that I capture only the size of each file.

Examples files

library(tidyverse)
library(exiftoolr)

 ### TODOS DIRECTORIOS
    rootPath = '//ALLIANCEDFS.ALLIANCE.CGIAR.ORG/CL07_GRPhistoric/GRP_OPERATION/SEEDS/Videometro/Blobs/BlobCollections Secundary & Tertiary color/'
    
    list_dir_2 <- data.frame(root=list.dirs(rootPath)) |> 
      mutate(folder = gsub(rootPath, "", root)) 

    list_dir_2_blobs <- list_dir_2 |> 
      filter(grepl("/blobs", folder)) |> 
      mutate(folder_final=paste0(root,'/'))

# function 

    get_hips_info <- function(directory) {
      file_names <- list.files(path.expand(directory), pattern = "\\.hips$")
      
      FileSize <- paste0(directory, file_names) %>%
        lapply(function(x) {
          cat("Reading file:", x, "\n")  # Print a message when reading each file
          exif_read(x)$FileSize
        }) %>% unlist

      hips <- data.frame(file = file_names, FileSize = FileSize) %>%
        mutate(measure_1 = paste0('bytes'))
  
      return(hips)
    }
    
 inicio <-  Sys.time() #  start

    #  fuction for each directory
    result_list <- lapply(list_dir_2_blobs $folder_final, get_hips_info)
    
    fin <-  Sys.time() # fin 
    
    tiempo_ejecucion <- fin-inicio;tiempo_ejecucion # time of run
    
    #  dataframe
    result_df <- bind_rows(result_list)

The histogram for the first 10265 .hips file:

Tnks!

nirgrahamuk · February 12, 2024, 3:30pm

you should profile your code, try profvis

spsanderson · February 13, 2024, 3:29pm

This will do it for you and quickly, directory is my own:

 hips_path <- "C:\\Users\\ssanders\\Desktop\\test_hips\\"
tictoc::tic()
df <- list.files(
    path = hips_path,
    full.names = TRUE,
    pattern = "\\.hips$"
  ) |>
  dplyr::as_tibble() |>
  dplyr::mutate(fsize = file.size(value))
tictoc::toc()

opt_breaks <- healthyR::opt_bin(.data = df, .value_col = fsize)[["value"]]

hist(df$fsize, 
     breaks = opt_breaks, 
     main = "File size distribution", 
     xlab = "File size (bytes)", 
     ylab = "Frequency",
     probability = TRUE
     )
lines(density(df$fsize), col = "red")

That took 0.01 seconds for 10 files.

M_AcostaCH · February 13, 2024, 3:45pm

Hi @spsanderson, this solution was very fast.

For 280716 .hips this is the result

> dim(df)
[1] 280716 
> tictoc::toc()
455.19 sec elapsed
7.5865 min

Really amazing. Tnks!

spsanderson · February 13, 2024, 4:46pm

I am happy it worked for you

system · February 20, 2024, 4:46pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.