Improve Performance for Building Data.Frame

rambrose · March 31, 2022, 4:20pm

Hello,

I am new to R, so hopefully this is the correct place to post general R questions. The issue that I am running into is the (seemingly excessive) time it takes to build a data.frame using my current strategy. I have a data.frame that is essentially HR data (I'll simply refer to it as HRdf)

ID   | DepartmentStartDate | DepartmentEndDate | Department | 
_____________________________________________________________
1-1  | 2016-04-03          | 2018-08-30        | 871        |
1-2  | 2018-08-30          | 2040-05-15        | 9914       |
2-1  | 2016-04-03          | 2018-08-30        | 871        |
.
.
.

where this data.frame has ~300,000 entries. From this, I want a data frame that contains the number of employees in a particular department, in a date range (which I'll refer to as Countdf)

Dates      | Department | Count |
_________________________________
2020-01-01 | 135        | 2500  |
2021-01-01 | 135        | 2700  |
.
.
.

Specifically, I choose my daterange and departmentrange by
daterange <- seq.Date(as.Date("2020-01-01"),as.Date(range(HRdf$DeptartmentEndDate)[2]),"years")
departmentrange <- c("772","6871","6872","6468")
To generate the data.frame of interest I do the following

library(dplyr)
library(reshape2)
Countdf <- data.frame(dates = daterange, 
                      sapply(departmentrange, function(iDept) sapply(daterange, function(iDate)
{count(Deptdf[Deptdf$DeptIDStart < iDate & iDate < Deptdf$DeptIDEnd & Deptdf$DeptID == iDept,])})))
colnames(Countdf) <- c("dates", departmentrange)
Countdf <- reshape2::melt(Countdf, id.vars="dates")

which yields the result I expect. However, to tackle 300,000 rows takes well over 2 hours where performing similar manipulations in pandas on my machine was done within minutes.

So my question, is there an obvious bottleneck in my method that is significantly slowing down the filtering/building the data.frame?

lalvarez · March 31, 2022, 4:38pm

Without testing anything, why not do something like?

data %>% group_by(dates, departmentrange) %>% count()

Example using mtcars

mtcars %>% group_by(cyl, gear) %>% count()
# A tibble: 8 × 3
# Groups:   cyl, gear [8]
    cyl  gear     n
  <dbl> <dbl> <int>
1     4     3     1
2     4     4     8
3     4     5     2
4     6     3     2
5     6     4     4
6     6     5     1
7     8     3    12
8     8     5     2

rambrose · March 31, 2022, 5:31pm

I was not aware of the group_by function, and seems like it could simplify some of the slicing I do on the HRdf. I can run it as

HRdf %>% group_by(DeptartmentStartDate < "2021-01-04" & DeptartmentEndDate > "2021-01-04",Deptartment) %>% count()
# A tibble: 16 x 3
# Groups:   DeptartmentStartDate < "2021-01-04" & DeptartmentEndDate > "2021-01-04", Deptartment [16]
   `DeptartmentStartDate < "2021-01-04" & DeptartmentEndDate > "2021-01-04"` Deptartment     n
   <lgl>                                                   <chr>  <int>
 1 FALSE                                                   134     9101
 2 FALSE                                                   138     9210
 3 FALSE                                                   6468   24545
 4 FALSE                                                   6871   24721
 5 FALSE                                                   6872   24769
 6 FALSE                                                   772    24734
 7 FALSE                                                   871    87974
 8 FALSE                                                   9914    8957
 9 FALSE                                                   NA     82547
10 TRUE                                                    134      631
11 TRUE                                                    138      636
12 TRUE                                                    6468    2767
13 TRUE                                                    6871    2710
14 TRUE                                                    6872    2747
15 TRUE                                                    772     2735
16 TRUE                                                    871     1673

where it should be easy to remove the FALSE rows and replace TRUE with the date, ie

Temp <- HRdf %>% group_by(DeptStart < "2021-01-04" & DeptEnd > "2021-01-04",Dept) %>% count() %>% filter_at(1,all_vars(. == TRUE)) %>% replace(1, "2021-01-04") %>% setnames(., c("Dates","Departments","Counts"))

which works pretty great and decently fast for a single date. When I loop over my date range via

lapply(daterange, function(idate) Deptdf %>% group_by(DeptIDStart < idate & DeptIDEnd > idate,DeptID) %>% count() %>% filter_at(1,all_vars(. == TRUE)) %>% replace(1, idate) %>% setnames(., c("Dates","Departments","Counts")))

it finishes in about 10 minutes, not bad for 300,000 entries and certainly faster than before. The only difference I can really see is replacing the slicing operator with the group_by operator...does that really cause such a significant performance difference? Or is there something with the piping that speeds the process up?

system · April 7, 2022, 5:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.