I have a database, a function, and from that, I can get coef
value (it is calculated through lm
function). There are two ways of calculating: the first is if I want a specific coefficient depending on an ID
, date
and Category
and the other way is calculating all possible coef
, according to subset_df1
.
The code is working. For the first way, it is calculated instantly, but for the calculation of all coefs
, it takes a reasonable amount of time, as you can see. I used the tictoc
function just to show you the calculation time, which gave 633.38 sec elapsed
. An important point to highlight is that df1
is not such a small database, but for the calculation of all coef
I filter, which in this case is subset_df1
.
I made explanations in the code so you can better understand what I'm doing. The idea is to generate coef
values for all dates >=
to date1
.
Finally, I would like to try to reasonably decrease this processing time for calculating all coef
values.
library(dplyr)
library(tidyr)
library(lubridate)
library(tictoc)
#database
df1 <- data.frame( Id = rep(1:5, length=900),
date1 = as.Date( "2021-12-01"),
date2= rep(seq( as.Date("2021-01-01"), length.out=450, by=1), each = 2),
Category = rep(c("ABC", "EFG"), length.out = 900),
Week = rep(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday"), length.out = 900),
DR1 = sample( 200:250, 900, repl=TRUE),
setNames( replicate(365, { sample(0:900, 900)}, simplify=FALSE),
paste0("DRM", formatC(1:365, width = 2, format = "d", flag = "0"))))
return_coef <- function(df1,idd,dmda,CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR01 and the values of the DRM columns
subsetDRM<- df1 %>% select(starts_with("DRM"))
DR1_subsetDRM<-cbind (df1, setNames(df1$DR1 - subsetDRM, paste0(names(subsetDRM), "_PV")))
subset_PV<-select(DR1_subsetDRM,Id, date2,Week, Category, DR1, ends_with("PV"))
result_median<-subset_PV %>%
group_by(Id,Category,Week) %>%
dplyr::summarize(dplyr::across(ends_with("PV"), median),.groups = 'drop')
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
Sum_DRM_result_median<-df1%>%
inner_join(result_median, by = c('Id','Category', 'Week')) %>%
mutate(across(matches("^DRM\\d+$"), ~.x + get(paste0(cur_column(), '_PV')),
.names = '{col}_{col}_PV')) %>%
select(Id:Category, DRM01_DRM01_PV:last_col())
Sum_DRM_result_median<-data.frame(Sum_DRM_result_median)
# Third idea: The idea here is to specifically filter a line from Sum_DRM_result_median, which will depend on what the user chooses, for that it will be necessary to choose an Id, date and Category.
# This code `remove_values_0` I use because sometimes i have values equal to zero so i remove these columns
remove_values_0 <- df1 %>%
dplyr::filter(Id==idd,date2 == ymd(dmda), Category == CategoryChosse) %>%
select(starts_with("DRM")) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(row_number())) %>%
mutate(cs = cumsum(value)) %>%
dplyr::filter(cs == 0) %>%
pull(name)
(dropnames <- paste0(remove_values_0,"_",remove_values_0, "_PV"))
filterid_date_category <- Sum_DRM_result_median %>%
filter(Id==idd,date2 == ymd(dmda), Category == CategoryChosse) %>%
select(-any_of(dropnames))
#Fourth idea: After selecting the corresponding row, I need to select the datas for coef calculation. For this, I delete some initial lines, which will depend on the day chosen.
datas <-filterid_date_category %>%
filter(Id==idd,date2 == ymd(dmda)) %>%
group_by(Category) %>%
summarize(across(starts_with("DRM"), sum),.groups = 'drop') %>%
pivot_longer(cols= -Category, names_pattern = "DRM(.+)", values_to = "val") %>%
mutate(name = readr::parse_number(name))
colnames(datas)[-1]<-c("days","numbers")
datas <- datas %>%
group_by(Category) %>%
slice((ymd(dmda) - min(as.Date(df1$date1) [
df1$Category == first(Category)])):max(days)+1) %>%
ungroup
# After I calculate the datas dataset, I used the lm function to obtain the coef value.
mod <- lm(numbers ~ I(days^2), datas)
coef<-coef(mod)[1]
val<-as.numeric(coef(mod)[1])
return(val)
}
To calculate the coef
of a specific ID
, Date
and Category
in my df1
database, I do:
return_coef(df1,"2","2021-12-10","ABC")
[1] 209.262 # This value may vary, as the values in my df1 database vary
To calculate all the coef
, I do:
tic()
subset_df1 <- subset(df1, date2 >= date1)
All<-subset_df1%>%
transmute(
Id,date2,Category,
coef = mapply(return_coef, list(cur_data()), Id, as.Date(date2), Category))
toc()
633.38 sec elapsed