Create subsets of records occurring on same dates and different dates

April · October 4, 2019, 3:36pm

I have a dataset which has clock in and clock out times of employees. Some of these employees have clocked in and clocked out on same day, some have clocked in on one day and clocked out on another day (for example clocked in today and clocked out tomorrow) and some have forgot to either entry their clock in or clock out time. I need to separate these 3 cases into three different datasets and calculate the number of hours an employee has worked in a day. Below is how the dataset looks like:

| Id |    Date   |                Time | Time_event_type |
|----|:---------:|--------------------:|-----------------|
| 82 | 3/11/2019 | 1899-12-31 10:41:51 | Clock-In        |
| 82 | 3/11/2019 | 1899-12-31 20:15:16 | Clock-Out       |
| 82 | 3/12/2019 | 1899-12-31 10:51:12 | Clock-In        |
| 82 | 3/12/2019 | 1899-12-31 20:10:07 | Clock-Out       |
| 19 | 3/11/2019 | 1899-12-31 07:00:00 | Clock-Out       |
| 19 | 3/12/2019 | 1899-12-31 05:45:57 | Clock-In        |
| 19 | 3/12/2019 | 1899-12-31 18:05:15 | Clock-Out       |
| 19 | 3/13/2019 | 1899-12-31 16:23:09 | Clock-In        |
| 19 | 3/14/2019 | 1899-12-31 18:05:15 | Clock-Out       |
| 19 | 3/18/2019 | 1899-12-31 17:44:23 | Clock-In        |
| 68 | 3/12/2019 | 1899-12-31 16:52:39 | Clock-In        |
| 68 | 3/13/2019 | 1899-12-31 05:00:50 | Clock-Out       |
| 68 | 3/13/2019 | 1899-12-31 17:03:01 | Clock-In        |
| 68 | 3/14/2019 | 1899-12-31 04:53:25 | Clock-Out       |
| 68 | 3/14/2019 | 1899-12-31 17:02:52 | Clock-In        |
| 68 | 3/15/2019 | 1899-12-31 05:20:22 | Clock-Out       |

which would be:

Id = c("82","82","82","82","19","19","19","19","19","19","68","68","68","68","68","68")
Date = c("3/11/2019","3/11/2019","3/12/2019","3/12/2019","3/11/2019","3/12/2019","3/12/2019",
         "3/13/2019","3/14/2019","3/18/2019","3/12/2019","3/13/2019","3/13/2019","3/14/2019","3/14/2019","3/15/2019")
Time = as.POSIXct(c("1899-12-31 10:41:51","1899-12-31 20:15:16","1899-12-31 10:51:12","1899-12-31 20:10:07",
"1899-12-31 07:00:00","1899-12-31 05:45:57","1899-12-31 18:05:15","1899-12-31 16:23:09","1899-12-31 18:05:15",
"1899-12-31 17:44:23","1899-12-31 16:52:39","1899-12-31 05:00:50","1899-12-31 17:03:01","1899-12-31 04:53:25",
"1899-12-31 17:02:52","1899-12-31 05:20:22"))
Time_event_type = c("Clock-In","Clock-Out","Clock-In","Clock-Out","Clock-Out","Clock-In", "Clock-Out",
  "Clock-In","Clock-Out","Clock-In","Clock-In", "Clock-Out","Clock-In","Clock-Out","Clock-In","Clock-Out")
df2=data.frame(Id,Date,Time,Time_event_type)
df2$Date= as.Date(df2$Date, format = "%m/%d/%Y")

The output should be as follows:

1st case: When the employees clock in and clock out on same date (grouped on Id and Date)

| Id |    Date   |                Time | Time_event_type | Diff(In seconds) |
|----|:---------:|--------------------:|-----------------|------------------|
| 82 | 3/11/2019 | 1899-12-31 10:41:51 | Clock-In        | NA               |
| 82 | 3/11/2019 | 1899-12-31 20:15:16 | Clock-Out       | 34405            |
| 82 | 3/12/2019 | 1899-12-31 10:51:12 | Clock-In        | NA               |
| 82 | 3/12/2019 | 1899-12-31 20:10:07 | Clock-Out       | 33535            |
| 19 | 3/12/2019 | 1899-12-31 05:45:57 | Clock-In        | NA               |
| 19 | 3/12/2019 | 1899-12-31 18:05:15 | Clock-Out       | 44358            |

2nd Case: When the employee clocks in and clocks out on different dates

| Id |    Date   |                Time | Time_event_type | Diff (In Seconds) |
|----|:---------:|--------------------:|-----------------|-------------------|
| 19 | 3/13/2019 | 1899-12-31 16:23:09 | Clock-In        | NA                |
| 19 | 3/14/2019 | 1899-12-31 18:05:15 | Clock-Out       | 6126              |
| 68 | 3/12/2019 | 1899-12-31 16:52:39 | Clock-In        | NA                |
| 68 | 3/13/2019 | 1899-12-31 05:00:50 | Clock-Out       | 43691             |
| 68 | 3/13/2019 | 1899-12-31 17:03:01 | Clock-In        | NA                |
| 68 | 3/14/2019 | 1899-12-31 04:53:25 | Clock-Out       | 42624             |
| 68 | 3/14/2019 | 1899-12-31 17:02:52 | Clock-In        | NA                |
| 68 | 3/15/2019 | 1899-12-31 05:20:22 | Clock-Out       | 44250             |

3rd Case:
When the employee forgets to either clock in or clock out (I don't need to calculate difference in this case)

| Id |    Date   |                Time | Time_event_type |
|----|:---------:|--------------------:|-----------------|
| 19 | 3/11/2019 | 1899-12-31 07:00:00 | Clock-Out       |
| 19 | 3/18/2019 | 1899-12-31 17:44:23 | Clock-In        |

In order to get the 1st case I tried the following code, but it gives all the duplicate records, even the ones on different dates.

df2_same=df2 %>% group_by(Id,Date) %>% mutate(dupe = n()>1)

For the 2nd case, I tried

df2_diff=df2[!(duplicated(df2[c("Id","Date")]) | duplicated(df2[c("Id","Date")], fromLast = TRUE)), ]

but this also lists out the 3rd case, which I want separately. Can someone please help me in getting the desired outputs?

andresrcs · October 5, 2019, 2:49pm

This seems tricky since there is no identifier for a complete "In-Out" cycle so I can't come up with a reliable method but maybe this would get you one step closer.

Id = c("82","82","82","82","19","19","19","19","19","19","68","68","68","68","68","68")
Date = c("3/11/2019","3/11/2019","3/12/2019","3/12/2019","3/11/2019","3/12/2019","3/12/2019",
         "3/13/2019","3/14/2019","3/18/2019","3/12/2019","3/13/2019","3/13/2019","3/14/2019","3/14/2019","3/15/2019")
Time = as.POSIXct(c("1899-12-31 10:41:51","1899-12-31 20:15:16","1899-12-31 10:51:12","1899-12-31 20:10:07",
                    "1899-12-31 07:00:00","1899-12-31 05:45:57","1899-12-31 18:05:15","1899-12-31 16:23:09","1899-12-31 18:05:15",
                    "1899-12-31 17:44:23","1899-12-31 16:52:39","1899-12-31 05:00:50","1899-12-31 17:03:01","1899-12-31 04:53:25",
                    "1899-12-31 17:02:52","1899-12-31 05:20:22"))
Time_event_type = c("Clock-In","Clock-Out","Clock-In","Clock-Out","Clock-Out","Clock-In", "Clock-Out",
                    "Clock-In","Clock-Out","Clock-In","Clock-In", "Clock-Out","Clock-In","Clock-Out","Clock-In","Clock-Out")
df2=data.frame(Id,Date,Time,Time_event_type)
df2$Date= as.Date(df2$Date, format = "%m/%d/%Y")

library(tidyverse)
library(lubridate)

df2 %>%
    mutate(date_time = ymd_hms(paste(Date, str_extract(Time, "\\s.+$")))) %>% 
    group_by(Id) %>% 
    mutate(incomplete = (Time_event_type == "Clock-In" & is.na(lead(Time_event_type))) |
               (Time_event_type == "Clock-Out" & is.na(lag(Time_event_type))) |
               (Time_event_type == "Clock-Out" & lag(Time_event_type) == "Clock-Out") |
               (Time_event_type == "Clock-In" & lead(Time_event_type) == "Clock-In")
           ) %>% 
    filter(incomplete == FALSE) %>%
    ungroup() %>% 
    mutate(event = rep(1:(nrow(.)/2), each = 2)) %>% 
    pivot_wider(id_cols = c(Id, event), names_from = Time_event_type, values_from = date_time) %>% 
    select(-event) %>% 
    mutate(diff = difftime(`Clock-Out`, `Clock-In`, units = "hour"))
#> # A tibble: 7 x 4
#>   Id    `Clock-In`          `Clock-Out`         diff           
#>   <fct> <dttm>              <dttm>              <drtn>         
#> 1 82    2019-03-11 10:41:51 2019-03-11 20:15:16  9.556944 hours
#> 2 82    2019-03-12 10:51:12 2019-03-12 20:10:07  9.315278 hours
#> 3 19    2019-03-12 05:45:57 2019-03-12 18:05:15 12.321667 hours
#> 4 19    2019-03-13 16:23:09 2019-03-14 18:05:15 25.701667 hours
#> 5 68    2019-03-12 16:52:39 2019-03-13 05:00:50 12.136389 hours
#> 6 68    2019-03-13 17:03:01 2019-03-14 04:53:25 11.840000 hours
#> 7 68    2019-03-14 17:02:52 2019-03-15 05:20:22 12.291667 hours

^{Created on 2019-10-05 by the reprex package (v0.3.0.9000)}

pieterjanvc · October 5, 2019, 3:23pm

Hi,

I came up with a method that will create the three datasets and takes the incomplete ones into account.

library(dplyr)
library(lubridate)
library(stringr)
library(purrr)

Id = c("82","82","82","82","19","19","19","19","19","19","68","68","68","68","68","68")
Date = c("3/11/2019","3/11/2019","3/12/2019","3/12/2019","3/11/2019","3/12/2019","3/12/2019",
         "3/13/2019","3/14/2019","3/18/2019","3/12/2019","3/13/2019","3/13/2019","3/14/2019","3/14/2019","3/15/2019")
Time = as.POSIXct(c("1899-12-31 10:41:51","1899-12-31 20:15:16","1899-12-31 10:51:12","1899-12-31 20:10:07",
                    "1899-12-31 07:00:00","1899-12-31 05:45:57","1899-12-31 18:05:15","1899-12-31 16:23:09","1899-12-31 18:05:15",
                    "1899-12-31 17:44:23","1899-12-31 16:52:39","1899-12-31 05:00:50","1899-12-31 17:03:01","1899-12-31 04:53:25",
                    "1899-12-31 17:02:52","1899-12-31 05:20:22"))
Time_event_type = c("Clock-In","Clock-Out","Clock-In","Clock-Out","Clock-Out","Clock-In", "Clock-Out",
                    "Clock-In","Clock-Out","Clock-In","Clock-In", "Clock-Out","Clock-In","Clock-Out","Clock-In","Clock-Out")
df2=data.frame(Id,Date,Time,Time_event_type)
df2$Date= as.Date(df2$Date, format = "%m/%d/%Y")

#Merge the date time and order by Id and date
df2 = df2 %>% mutate(dateTime = as.POSIXct(paste(Date, str_extract(Time, "\\d+:\\d+:\\d+")))) %>% 
  arrange(Id, dateTime) %>% mutate(order = 1:n())

#Check all Clock-In and see if check out is available  
myData = map_df(df2 %>% filter(Time_event_type == "Clock-In") %>% pull(order), function(x){
  
  if(is.na(df2$Time_event_type[x+1]) || df2$Id[x] != df2$Id[x+1]){
    #Last record of employee is Clock-In
    data.frame()
    
  } else if(df2$Time_event_type[x+1] == "Clock-Out"){
    #Next record is clock out with same Id
    data.frame(Id = df2$Id[x], 
               clockIn = df2$dateTime[x], 
               clockOut = df2$dateTime[x+1],
               order = df2$order[x])
  } else {
    #Missing data
    data.frame()
  }
}) 

#Check if clock in and out are on same day (use as_date from lubridate!)
myData = myData %>% mutate(timeDiff = interval(clockIn, clockOut) / seconds(1),
              sameDay = as_date(clockIn) == as_date(clockOut))

#Create the same and different dates datasets
sameDay = myData %>% filter(sameDay == T) %>% select(-order, -sameDay)
diffDay = myData %>% filter(sameDay == F) %>% select(-order, -sameDay)

#Any recored not appearing either of the above datasets is an incomplete one
incompleteDay = map_df(setdiff(df2$order, c(myData$order, myData$order +1)), function(x){
  if(df2$Time_event_type[x] == "Clock-In"){
    data.frame(Id = df2$Id[x], 
               clockIn = df2$dateTime[x], 
               clockOut = NA)
  } else {
    data.frame(Id = df2$Id[x], 
               clockIn = NA, 
               clockOut = df2$dateTime[x])
  }
})

#RESULT
sameDay
  Id             clockIn            clockOut timeDiff
1 19 2019-03-12 05:45:57 2019-03-12 18:05:15    44358
2 82 2019-03-11 10:41:51 2019-03-11 20:15:16    34405
3 82 2019-03-12 10:51:12 2019-03-12 20:10:07    33535

diffDay
  Id             clockIn            clockOut timeDiff
1 19 2019-03-13 16:23:09 2019-03-14 18:05:15    92526
2 68 2019-03-12 16:52:39 2019-03-13 05:00:50    43691
3 68 2019-03-13 17:03:01 2019-03-14 04:53:25    42624
4 68 2019-03-14 17:02:52 2019-03-15 05:20:22    44250

incompleteDay
  Id             clockIn            clockOut
1 19                <NA> 2019-03-11 07:00:00
2 19 2019-03-18 17:44:23                <NA>

Hope this helps,
PJ

April · October 7, 2019, 7:58pm

Thanks a lot! This works perfectly

system · October 14, 2019, 7:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.