For Loop taking hours to run (JSON data)

kwu · February 2, 2021, 8:36pm

I'm currently trying to read in a JSON file (typically around ~250 MB) and extract parts of the data into a usable dataframe. I'm using a for loop for each "list" within the JSON file, but the issue is that it takes nearly hours for the code to run.

Would you say that this could be considered a normal amount of time to loop through this file? Or would there be a faster way to do this, perhaps using purrr? Or is there some other time-greedy element that I could re-manage.

Code of my for loop below:

file <-  somejsonfile.JSON
jsonfunction <- function(file) {
  x <- list()
  num <- length(file)
  m <- list()
  i <- 1
  j <- 1
  for (i in 1:num ) {
    num1 <- length(file[[i]]$Summary$Y)
    for (j in 1:num1) {
      m[[j]] <- unlist(file[[i]]$Summary$Y[[j]])
      m[[j]][["UTC"]] <- as.character(file[[i]][["Summary"]][["Time"]])
    }
    x <- rbind(x,m)
    m <- list()
  }
  df <- map_df(x, ~as.data.frame(t(.)))
  df[df == 'NULL'] <- NA
  df <- as.data.frame(lapply(df, unlist))
  return(df)
}

Thank you if you are able to help!!

technocrat · February 2, 2021, 10:16pm

For data at scale, vectorized operators like purrr::map run aren't necessarily faster than loops. See §24.1 of r4ds. However, using them avoids the time sink zombie of loops. .

The output : output <- vector("double", length(x)) . Before you start the loop, you must always allocate sufficient space for the output. This is very important for efficiency: if you grow the for loop at each iteration using c() (for example), your for loop will be very slow.

§21.1

system · February 23, 2021, 10:16pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.