Hello,
I am new to R, so hopefully this is the correct place to post general R questions. The issue that I am running into is the (seemingly excessive) time it takes to build a data.frame using my current strategy. I have a data.frame that is essentially HR data (I'll simply refer to it as HRdf)
ID | DepartmentStartDate | DepartmentEndDate | Department |
_____________________________________________________________
1-1 | 2016-04-03 | 2018-08-30 | 871 |
1-2 | 2018-08-30 | 2040-05-15 | 9914 |
2-1 | 2016-04-03 | 2018-08-30 | 871 |
.
.
.
where this data.frame has ~300,000 entries. From this, I want a data frame that contains the number of employees in a particular department, in a date range (which I'll refer to as Countdf)
Dates | Department | Count |
_________________________________
2020-01-01 | 135 | 2500 |
2021-01-01 | 135 | 2700 |
.
.
.
Specifically, I choose my daterange and departmentrange by
daterange <- seq.Date(as.Date("2020-01-01"),as.Date(range(HRdf$DeptartmentEndDate)[2]),"years")
departmentrange <- c("772","6871","6872","6468")
To generate the data.frame of interest I do the following
library(dplyr)
library(reshape2)
Countdf <- data.frame(dates = daterange,
sapply(departmentrange, function(iDept) sapply(daterange, function(iDate)
{count(Deptdf[Deptdf$DeptIDStart < iDate & iDate < Deptdf$DeptIDEnd & Deptdf$DeptID == iDept,])})))
colnames(Countdf) <- c("dates", departmentrange)
Countdf <- reshape2::melt(Countdf, id.vars="dates")
which yields the result I expect. However, to tackle 300,000 rows takes well over 2 hours where performing similar manipulations in pandas on my machine was done within minutes.
So my question, is there an obvious bottleneck in my method that is significantly slowing down the filtering/building the data.frame?