Calculating new group variable based on timestamps

jwin · August 28, 2019, 6:01pm

I have a data frame containing continuous data of subjects emotional responses to different stimuli. Some of the stimuli have three parts. Now I want to create a new variable based on the timestamps in the dataset to assign the rows of the data to the different stimulus-parts. I've created a testdataset to illustrate my problem:

#create variable subject
subject=c("VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02")

#create variable event
event=c("calib", "calib", "stim1", "stim1", "stim1", "stim2", "stim2", "stim2", "stim2", "stim2", "stim2", "calib", "calib", "stim1", "stim1", "stim1", "stim3", "stim3", "stim3", "stim3", "stim3", "stim3", "stim3", "stim3", "stim3")

#create variable sad
sad=c(0, 0, 1, 1, 2, 3, 3, 6, 6, 4, 7, 1, 1, 2, 1, 1, 4, 7, 2, 4, 6, 7, 5, 4, 6)

#create variable happy
happy=c(0, 1, 1, 0, 2, 3, 4, 6, 7, 4, 6, 1, 1, 2, 5, 1, 4, 6, 2, 7, 4, 7, 5, 2, 3)

#create variable time
time=c("00:10:49.863", "00:10:50.863", "00:10:51.863", "00:10:52.863", "00:10:53.863", "00:10:54.863", "00:10:55.863", "00:10:56.863", "00:10:57.863", "00:10:58.863", "00:10:59.863", "00:11:00.863", "00:11:01.863", "00:11:02.863", "00:11:03.863", "00:11:04.863", "00:11:05.863", "00:11:06.863", "00:11:07.863", "00:11:08.863", "00:11:09.863", "00:11:10.863", "00:11:11.863", "00:11:12.863", "00:11:13.863")

#create test data set
testdata <- data.frame(subject,event,time,sad,happy)

My first problem is that currently the timestamps are string variables, but I guess I would have to somehow convert them into actual timestamps because of my second problem. The format here is hh:mm:ss.000, but actually hh:mm:ss would suffice.
The variable "event" indicates which stimulus the data belongs to. stim2 and stim3 are divided into three parts. The beginning of part 1 is relative, so for each subject the timestamp differs. So the beginning of part 1 of stim2 I would just infer from the first row in which stim2 appears for a subject. The end of part 1 however is always after 2 seconds. The end of part 2 is after 1 second and the end of part 3 whenever "stim2" appears for the last time in the column "event" for one subject. For stim3 it's similar: Beginning of part 1 is the first appearance of "stim3" in column event, end of part 1 after 1 second, end of part 2 however after 2 seconds, end of part 3 whenever "stim3" appears the last time in "event" for one subject.

So what I would like to end up with is a dataframe that looks like testdata2:

part=c("calib", "calib", "stim1", "stim1", "stim1", "stim2_1", "stim2_1", "stim2_1", "stim2_2", "stim2_3", "stim2_3","calib", "calib", "stim1", "stim1", "stim1", "stim3_1", "stim3_1", "stim3_2", "stim3_2", "stim3_2", "stim3_3", "stim3_3", "stim3_3", "stim3_3")

#create test data set
testdata2 <- data.frame(subject,event,part,time,sad,happy)

My actual data is much more finegrained (7-8 rows per second) but I tried to simplify things for the sake of this example, I hope it works anyways. I'm an absolute beginner with R and coding in general and I don't know how to go about this at all - any help is greatly appreciated!!!

technocrat · August 28, 2019, 6:51pm

Hi, I suggest first taking a look at the lubridate package to convert your strings into datetime objects. It's pretty easy. Then, if you can put your sample into reproducible example, called a reprex, it will be easier for everyone to weigh in.

valeri · August 28, 2019, 7:52pm

The logic you describe in step 2 is very difficult to follow. I hope this will at least bring you a step closer where you know where each event for each subject starts and ends. You will then need to put additional logic for the middle parts of stim2 and stim3.

library(tidyverse)

#create variable subject
subject=c("VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP01", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02", "VP02")

#create variable event
event=c("calib", "calib", "stim1", "stim1", "stim1", "stim2", "stim2", "stim2", "stim2", "stim2", "stim2", "calib", "calib", "stim1", "stim1", "stim1", "stim3", "stim3", "stim3", "stim3", "stim3", "stim3", "stim3", "stim3", "stim3")

#create variable sad
sad=c(0, 0, 1, 1, 2, 3, 3, 6, 6, 4, 7, 1, 1, 2, 1, 1, 4, 7, 2, 4, 6, 7, 5, 4, 6)

#create variable happy
happy=c(0, 1, 1, 0, 2, 3, 4, 6, 7, 4, 6, 1, 1, 2, 5, 1, 4, 6, 2, 7, 4, 7, 5, 2, 3)

#create variable time
time=c("00:10:49.863", "00:10:50.863", "00:10:51.863", "00:10:52.863", "00:10:53.863", "00:10:54.863", "00:10:55.863", "00:10:56.863", "00:10:57.863", "00:10:58.863", "00:10:59.863", "00:11:00.863", "00:11:01.863", "00:11:02.863", "00:11:03.863", "00:11:04.863", "00:11:05.863", "00:11:06.863", "00:11:07.863", "00:11:08.863", "00:11:09.863", "00:11:10.863", "00:11:11.863", "00:11:12.863", "00:11:13.863")

#create test data set
testdata <- tibble(subject,event,time,sad,happy) %>% 
	mutate(time = as.POSIXct(strptime(substr(time, 1, 8), "%H:%M:%S"))) %>% 
	group_by(subject) %>% 
	mutate(dt_secs =  as.numeric(difftime(time, lag(time), units = 'secs'))) %>% 
	tidyr::replace_na(list(dt_secs = 0)) %>% 
	group_by(subject, event) %>% 
	mutate(cum_time = cumsum(dt_secs),
				 is_first_for_event = cum_time == min(cum_time),
				 is_last_for_event = cum_time == max(cum_time))

testdata
# A tibble: 25 x 9
# Groups:   subject, event [6]
   subject event time                  sad happy dt_secs cum_time is_first_for_event is_last_for_event
   <chr>   <chr> <dttm>              <dbl> <dbl>   <dbl>    <dbl> <lgl>              <lgl>            
 1 VP01    calib 2019-08-28 00:10:49     0     0       0        0 TRUE               FALSE            
 2 VP01    calib 2019-08-28 00:10:50     0     1       1        1 FALSE              TRUE             
 3 VP01    stim1 2019-08-28 00:10:51     1     1       1        1 TRUE               FALSE            
 4 VP01    stim1 2019-08-28 00:10:52     1     0       1        2 FALSE              FALSE            
 5 VP01    stim1 2019-08-28 00:10:53     2     2       1        3 FALSE              TRUE             
 6 VP01    stim2 2019-08-28 00:10:54     3     3       1        1 TRUE               FALSE            
 7 VP01    stim2 2019-08-28 00:10:55     3     4       1        2 FALSE              FALSE            
 8 VP01    stim2 2019-08-28 00:10:56     6     6       1        3 FALSE              FALSE            
 9 VP01    stim2 2019-08-28 00:10:57     6     7       1        4 FALSE              FALSE            
10 VP01    stim2 2019-08-28 00:10:58     4     4       1        5 FALSE              FALSE            
# … with 15 more rows

jwin · August 30, 2019, 9:32am

Thank you! It does look pretty straightforward, I just used hms(time). However, the variable is still of type "character", but I'm not really sure if that description is supposed to change for a datetime variable

jwin · August 30, 2019, 9:40am

Awesome, thank you, this really helps!!

jwin · August 30, 2019, 9:50am

Actually, I have one more question: Is there a way to not include the date in the time variable?

valeri · August 30, 2019, 10:05am

Yes - a good point. As far as I can see, dealing with times alone (without an associated date) is difficult in R. You can always strip the "yyyy-mm-dd" part of the time column but then it will have to be a string variable not a date (or any format like POSIXct or POSIXlt) which would require a proper date. Again, not 100% certain but that's how I see it. But you can have a look here if it helps: https://stackoverflow.com/questions/22659947/r-how-to-handle-times-without-dates

jwin · August 30, 2019, 10:45am

Allright, good to know! Thanks for the hint and the code, really, this has been immensely helpful!

jwin · August 30, 2019, 11:47am

Sorry I'm asking so many questions, but maybe you could help me one more time: I managed to add the additional logic for the middle parts that works for the testdataset. However I'm having some trouble applying it to my actual dataset. I assume it's because the dataset I'm working with was not created from scratch, so the variables "time", "subject" etc. don't exist as variables in my environment. I just imported a txt file as a dataframe in R. How would I have to modify the code if all the objects in my environment were the complete dataframe "testdata"?

valeri · August 30, 2019, 11:50am

If I understand you correctly, then just do (you can also assign to a different df like testdata2 if you don't want to overwrite testdata)

#create test data set
testdata <- testdata %>% 
	mutate(time = as.POSIXct(strptime(substr(time, 1, 8), "%H:%M:%S"))) %>% 
	group_by(subject) %>% 
	mutate(dt_secs =  as.numeric(difftime(time, lag(time), units = 'secs'))) %>% 
	tidyr::replace_na(list(dt_secs = 0)) %>% 
	group_by(subject, event) %>% 
	mutate(cum_time = cumsum(dt_secs),
				 is_first_for_event = cum_time == min(cum_time),
				 is_last_for_event = cum_time == max(cum_time))

jwin · August 30, 2019, 12:29pm

That worked perfectly, thank you so much!

technocrat · August 31, 2019, 4:56am

(Glad others could weigh in while I was offline).

Please never be sorry for asking clear questions that illuminate the answer. Those who come after may find your thread and see the path to a solution more easily.

Which is why, when the final piece of the puzzle is complete, it's very helpful to mark that as the "solution." (And nobody gets their feelings hurt who had earlier pieces of the complete answer; as far as I've seen no one asserts bragging rights!)

system · September 7, 2019, 4:56am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.