Split hundreds of millions of rows into lists to apply function

I am trying to run prophet model to forecast demand on each of the store item pairs. This is the function which I have to run on 1.5 million store item pairs. This is the function which I am trying to apply to each pair:

{ #Sort by Date
  #Divide data into test and train
  test<-tail(df, ceiling(nrow(df)*0.06))
  train<-df[!df$ds %in% test$ds,]
  #Train model
  model_prophet <- prophet(train[,c('ds','y')], holidays = holidays,daily.seasonality = FALSE)
  #Test Model
  test_forecast = predict(model_prophet, test)
  #Predict for next week
  dates<-as.data.frame(seq(as.Date(Sys.Date())+1, by = "day", length.out = 7))
  forecast = predict(model_prophet, dates)
  forecast<- forecast[, c("ds","yhat","yhat_upper","yhat_lower")]
  forecast<-forecast %>% mutate(item = unique(factor(df$item)), store=unique(factor(df$store)))
  #Test accuracy 

I need to split 180 million rows into lists of unique pair of columns. Then, I want to apply a function on each of these lists using parLapply(). But the R session crashes or just keeps on running when I try to split the dataframe into lists. I have tried the split() and group_split() so far:

data<-df %>% group_split(col1,col2)

data <- split (df, list( df$col1, df$col2)))

I am trying to do parLapply but couldn't run without splitting the dataframe into lists. Also, since I am working on Windows it is difficult to load this data on each cluster.

result <- parLapply(cl, data, prophet_model))

I also tried to apply function directly using do() but it shown 1000 hours for completion:

data<-df %>% group_by(col1,col2) %>% do(function(.))

This function works on a small dataset. I have tried parallel processing and do() function for few pairs and it worked fine.
Please let me know if there is any other way of splitting or applying function to this large dataset.

I don't know what you are doing, but I'll bet that it is probably not required for you to actually split data into lists. Can you give a small example of what you are trying to do and why it's only possible with lapply? If your data fits into memory you can also use future/furrr to run things in parallel.

If you do need to do it all at once then I would first make sure that it works on subset of data. However, doing something with that much data will always be a challenge, so you might want to do it in a more performant language or use something like Spark via SparkR or sparlklyr.

Thanks for your response. I have added more details about my problem in the above description. I am trying to scale the implementation of my model. I was trying to split it into lists so as to apply parallel processing.

I will try future/furr if that works. Will that work on the entire dataset directly?

Yes, the way furrr works is that it uses future as a way to distribute the computation. You can even distribute to remote cluster to speed up the process. But if you are trying to use prophet on 180 mln lines you are going to have bad time. prophet uses Bayesian statistics/Stan that is quite computationally heavy, so regardless of what you do, it'll take a very (very) long time.

So, either simplify your problem (e.g., take only x days/weeks/months), rent beefy server with lots and lots of cores and RAM, or prepare to wait for days.


This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.