Hello all,
I hope someone can provide me with a great solution to my problem:
Say I have a data frame with columns: Activity and Date
I want to create a new column called Coding that's value is "G" if the row has Activity == "Game" or make the value "G1" if the row has a Date of 1 less than the date for a row with Activity == "Game". "G2" if it is 2 days before....and so on
So, basically I am looking for a way to conditionally add a column based on the two columns: Activity and Date. I feel like date labeling is popular and maybe there is something out there already that knows how to label something based on days away from that target date.
The basic structure would look like:
activity <- c("Game", "Practice", "Game", "Practice")
date <- c("02/19/2019", "02/21/2019", "02/22/2019", "02/23/2019")
df <- data.frame(activity,date)
I would need to make df$coding == "G" or "G1" or "G2" or "G3"....etc. based on the above conditions
Using mutate() and case_when() functions from the dplyr package is my preferred way to make conditional statements and the glue package to make labels. Lubridate is my goto library for handling dates. I have used the lubridate::today() function to make it dynamic but if you want to save a specific date then you just replace that part with the name of the date object created.
Thanks for the reply.This is the output. I need it to be "G1" if the date is 1 day before a game date and "G2" if it is 2 days before the game date...etc. Any idea on this? I think the lubridate::today() is not necessary.
you haven't really explained clearly the relationship between rows,dates, and activity labels. Seems like you want to ignore practice. Easy enough.
every Game activity needs a code next to it starting with G.
Fine too.
The code should be extend to numerically encode what ?
the number of days between that record (and the current date as in the provided code you were given) .
the number of days between that record and the previous Game ?
the number of days between that record and the first Game in the dataset ?
there are probably more possible rules, would be good to understand your intentions!
All rows with Game should be labeled G...easy enough.
Rows with Practice need to be labeled G1, G2, G3, G4, G5, G6, or G7 depending on how many days it is before the Game. So, I see a benefit to ordering these rows by date as a first step. This was obviously a small example, my data is a bit larger with a lot more practices. For instance, usually 6/week with 1 game a week. And I am trying to find a way to label these practices each week depending on how many days is it is before a game. I hope this helps.
library(tidyverse)
library(lubridate)
set.seed(42)
activity_source <- sample.int(10,50,replace=TRUE)
activity <- ifelse(activity_source<2,"Game","Practice")
d1 <- cumsum(sample.int(5,50,replace=TRUE))
d2 <- d1 - max(d1)
date <- Sys.Date() + days(d2)
df <- data.frame(activity,date)
df <- arrange(df,desc(date))
df$nextdate <- lag(df$date)
df$days_between <- df$nextdate -df$date
df$gdays <- NA
for( i in 1:nrow(df)){
if (df$activity[[i]] == "Game")
df$gdays[[i]] <- 0
else {
if (i>1)
df$gdays[[i]] = df$gdays[[i-1]] + df$days_between[[i]]
else df$gdays[[i]] = NA
}
}
df$gcode <- ifelse(df$gdays>0,paste0("G",df$gdays),"G")
df <- arrange(df,date)
> df
activity date nextdate days_between gdays gcode
1 Game 2019-09-14 2019-09-18 4 days 0 G
2 Practice 2019-09-18 2019-09-21 3 days 3 G3
3 Game 2019-09-21 2019-09-26 5 days 0 G
4 Practice 2019-09-26 2019-09-28 2 days 12 G12
5 Practice 2019-09-28 2019-09-30 2 days 10 G10
6 Practice 2019-09-30 2019-10-02 2 days 8 G8
7 Practice 2019-10-02 2019-10-07 5 days 6 G6
8 Practice 2019-10-07 2019-10-08 1 days 1 G1
9 Game 2019-10-08 2019-10-09 1 days 0 G
10 Practice 2019-10-09 2019-10-13 4 days 62 G62
11 Practice 2019-10-13 2019-10-18 5 days 58 G58
12 Practice 2019-10-18 2019-10-20 2 days 53 G53
13 Practice 2019-10-20 2019-10-21 1 days 51 G51
14 Practice 2019-10-21 2019-10-26 5 days 50 G50
15 Practice 2019-10-26 2019-10-30 4 days 45 G45
16 Practice 2019-10-30 2019-11-03 4 days 41 G41
17 Practice 2019-11-03 2019-11-04 1 days 37 G37
18 Practice 2019-11-04 2019-11-07 3 days 36 G36
19 Practice 2019-11-07 2019-11-10 3 days 33 G33
20 Practice 2019-11-10 2019-11-15 5 days 30 G30
21 Practice 2019-11-15 2019-11-20 5 days 25 G25
22 Practice 2019-11-20 2019-11-24 4 days 20 G20
23 Practice 2019-11-24 2019-11-29 5 days 16 G16
24 Practice 2019-11-29 2019-12-03 4 days 11 G11
25 Practice 2019-12-03 2019-12-05 2 days 7 G7
26 Practice 2019-12-05 2019-12-07 2 days 5 G5
27 Practice 2019-12-07 2019-12-08 1 days 3 G3
28 Practice 2019-12-08 2019-12-10 2 days 2 G2
29 Game 2019-12-10 2019-12-15 5 days 0 G
30 Practice 2019-12-15 2019-12-20 5 days 55 G55
31 Practice 2019-12-20 2019-12-24 4 days 50 G50
32 Practice 2019-12-24 2019-12-25 1 days 46 G46
33 Practice 2019-12-25 2019-12-29 4 days 45 G45
34 Practice 2019-12-29 2019-12-31 2 days 41 G41
35 Practice 2019-12-31 2020-01-02 2 days 39 G39
36 Practice 2020-01-02 2020-01-05 3 days 37 G37
37 Practice 2020-01-05 2020-01-10 5 days 34 G34
38 Practice 2020-01-10 2020-01-15 5 days 29 G29
39 Practice 2020-01-15 2020-01-18 3 days 24 G24
40 Practice 2020-01-18 2020-01-22 4 days 21 G21
41 Practice 2020-01-22 2020-01-27 5 days 17 G17
42 Practice 2020-01-27 2020-01-28 1 days 12 G12
43 Practice 2020-01-28 2020-02-01 4 days 11 G11
44 Practice 2020-02-01 2020-02-02 1 days 7 G7
45 Practice 2020-02-02 2020-02-03 1 days 6 G6
46 Practice 2020-02-03 2020-02-08 5 days 5 G5
47 Game 2020-02-08 2020-02-11 3 days 0 G
48 Practice 2020-02-11 2020-02-12 1 days NA <NA>
49 Practice 2020-02-12 2020-02-14 2 days NA <NA>
50 Practice 2020-02-14 <NA> NA days NA <NA>
WOW! Thank you so much for this. I am still trying to understand how this works! I have not heard of the lag function before either! @nirgrahamuk could you help me understand this? I replicated this and if the first row is practice, i is less than 1 and thus the gdays should be NA, but it calculates correctly somehow.
for( i in 1:nrow(df)){
if (df$activity[[i]] == "Game")
df$gdays[[i]] <- 0
else {
if (i>1)
df$gdays[[i]] <- df$gdays[[i-1]] + df$days_between[[i]] # Confused how this row calculates properly
else df$gdays[[i]] = NA
}
}
For me the key to think about is that we are counting from practice to future games. Which is easier if you reverse time and count from games to past practices. So that's the general concept of my approach
But for the last screenshot I sent. The 1st iteration of the for loop i will = 1.... since it is for i in 1:50 for example. So activity was not Game and 1 is not >1 so then shouldn't the value be NA and not the 13 gdays? Can you help me understand how the 13 was received in that first row of my results?
Your print out is of the very last step right. This is the arrange function. When the i iteration was happening the frame was in another order.
Drop in some extra print statements along the way