Hey!
I have data from this online form which users may fill in at any time — while some will submit their answers daily, others will do so more sparsely (e.g. twice a week, once a month, etc.). Here's made up data that loosely resembles the data I have:
library(tidyverse)
# generate data frame
id <- c(1,1,1,1,2,2,3,4,5,5,5,5,1,1,1,1) # User ID
date <- c("2021-12-26", "2021-12-19", "2021-12-15", "2021-12-07", "2021-11-11", "2021-11-05", "2021-09-17","2021-09-17", "2021-10-08", "2021-10-06", "2021-10-01", "2021-09-30", "2022-01-30", "2022-01-24", "2022-01-18", "2022-01-13") # Date the form was submitted
variable1 <- c(10, NA, NA, NA, 8, NA, 7, 6, 9, NA, NA, NA, 6, 8, NA, NA)
variable2 <- c(5,2,3,4,6,7,8,9,1,4,3,2,5,6,5,4)
sample_data <- data.frame(id, date, variable1, variable2)
sample_data <- sample_data %>%
mutate(date=as.Date(date, format="%Y-%m-%d"))
# id date variable1 variable2
#1 1 2021-12-26 10 5
#2 1 2021-12-19 NA 2
#3 1 2021-12-15 NA 3
#4 1 2021-12-07 NA 4
#5 2 2021-11-11 8 6
#6 2 2021-11-05 NA 7
#7 3 2021-09-17 7 8
#8 4 2021-09-17 6 9
#9 5 2021-10-08 9 1
#10 5 2021-10-06 NA 4
#11 5 2021-10-01 NA 3
#12 5 2021-09-30 NA 2
#13 1 2022-01-30 6 5
#14 1 2022-01-24 8 6
#15 1 2022-01-18 NA 5
#16 1 2022-01-13 NA 4
Each line in the dataframe is a separate entry (i.e., a separate form submission). Each user is represented by a unique user ID (ID in the dataframe) — as you can see, there are multiple lines with the same ID, representing separate form submissions for each user. The date the form is submitted is also available (date column).
Then, there are two separate variables of interest (variable1 and variable 2).
variable1 is a numeric variable, corresponding to the answer the user submits to the question "How have you been feeling during the past 4 weeks?". Replying to this question is optional when submitting the form and there are, therefore, some missing values. variable2 is also a numeric variable, corresponding to "How are you feeling today?". This is a required field in the form, hence no missing values.
So, I have a variable looking at a 4 weeks period (variable1) and another looking at a 1-day period (variable2). My question is: is variable1 accurately measuring how the users felt in the past 4 weeks?
In order to find out, I need to compare each user's data from variable1 with data from variable2 for the 4 weeks period prior to them submitting data for variable1. For example, for user #1 in the sample data, there is an entry for variable1 on the 2021-12-26 (on that day, they said they've been feeling a "10" over the past 4 weeks). Luckily, I also have one entry of daily data (variable2) for one day in each of the 4 weeks before the 26th of December ("5" on the 26th, "2" on the 19th, "3" on the 15th and "4" on the 7th of December).
Basically, I think I should filter the data frame for users with at least one valid variable1 entry, get the date from this entry. This could be a date2 column, like so:
sample_data <- sample_data %>%
mutate(date2=as.Date(ifelse(!is.na(variable1),paste(date,sep=""), NA)))
Then, I would get all the entries for the 28 days (7 days × 4 weeks) before this date. Finally, because I'm only interested in data from users who have at least one entry each week during those 28 days, I need to filter those users who have at least one entry for the date2 to date2 - 7 days range, one entry for the date2 - 7 to date2 - 14 range, one entry for the date2 - 14 to date2 - 21 range, and one entry for the date2 - 21 to date2 - 28 range).
The problem is, I've been using R for a few months "only", and I have no idea how to approach this problem code-wise.
Does anyone know of the best way to do this?
Thanks in advance.