Dear All, I am trying to run a function (growth over year) on each row of data.frame / data.table / tbl, since it is huge (1.5 million row). My main problem that it takes long time.
I am wondering if some one can help me in this issue, via using mutate in tidyverse, or by_row in purrrlyr, or any function in data.table.
#' ---
#' output:
#' md_document:
#' variant: markdown_github
#' ---
#+ reprex-setup, include = FALSE
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", error = TRUE)
knitr::opts_knit$set(upload.fun = knitr::imgur_upload)
knitr::opts_chunk$set(tidy = TRUE, tidy.opts = list(indent = 2))
#+ reprex-body
library(tidyverse)
library(purrrlyr)
library(reprex)
library(data.table)
growth.ls <- function(values, n = seq_along(values)){
# values <- as.numeric(values)
if (anyNA(values)) return(NA_real_)
if (any(!is.finite(values))) return(NA_real_)
if (any(values <= 0)) return(NA_real_)
# ln of y values then calculate regression line
z <- lm(log(values, base = exp(1))~(n))
# Exp of slope -1
(exp(z$coefficients[[2]]) - 1) * 100
}
set.seed(45L)
y <- data.table(x = sample(letters[1:10], 10^6, replace = TRUE),
x1 = sample(0.5:4, 10^6, replace = TRUE),
x2 = rnorm(10^6),
x3 = rnorm(10^6)
)
y %>% by_row(.collate = "rows", ..f = function(this_row){
this_row[2:4] %>% unlist %>% growth.ls})
Can you provide a smaller example showing the desired output?
x <- data.table(y = letters[1:4], x1990 = c(1,1,1,2), x1991 = c(2,1,1,1), x1992 = c(3,3,3,0.5), x1993 = c(5,2,2,4), x1994 = c(7,3, 5, NA_real_), x1995 = c(9, 8, 10,1))
if applied growth.ls to year columns (x1990: x1995)
apply(x[, paste0("x", 1990:1995), with = FALSE], 1, growth.ls)
# [1] 56.88514 53.77536 58.00203 NA
what if x has 1.5 million rows? what is the best way to calculate a new column "growth" rather than using apply.
I do not mind use any package since it is helping me delivered my work.
Thanks in advance
One thing I can suggest is to take a look at the "R for Data Science" book, specifically at the chapter "Many Models". I think, it more or less doing what you want to do.
However, one general note is that when you do anything 1.5 million times it is quite obviously going to take a long time. What you can do to speed it up a little (depending on number of cores you have) is to use any of the parallel packages available in R, e.g., future, mclapply etc.
Finally, you can try using some big data solutions, such as Spark or h2o, but then, of course, you need to do find a way to translate what you doing into appropriate code. On a similar note, you can try to reformulate your problem to build 1 huge model instead of 1.5 million small ones. With Spark and h2o this approach is likely to be more efficient, since they will do lots of optimizations for you.
2 Likes
Is it possible to provide any example so i could follow.
Package ‘multicore’ was removed from the CRAN repository.
Formerly available versions can be obtained from the archive.
Consider using package ‘parallel’ instead.
However, when I install packages via Rstudio (Windows), there are a lot of parallelDist, ParallelForest, etc. and parallel package is not included.
I followed this post to use future package with purrr.