How to create a new column in a dataframe depending on other columns values

marcrocalleva · May 9, 2022, 10:44pm

I have a dataset about the years that different subjects took a certain treatment. I need to obtain a column that sets the first year of treatment and 0 if the subject has never been treated.

Let's say I have this dataset:

subject <- c(A, A, A, A, A, B, B, B, B, B, C, C, C, C, C)
year <- c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004)
treat <- c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0)
df1 <- data.frame(subject, year, treat)

I want to obtain this:

subject <- c(A, A, A, A, A, B, B, B, B, B, C, C, C, C, C)
year <- c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004)
treat <- c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0)
first_treat <- c(2003, 2003, 2003, 2003, 2003, 2001, 2001, 2001, 2001, 2001, 0, 0, 0, 0, 0)
df1 <- data.frame(subject, year, treat, first_treat)

In my original dataset I have mulriple subjects, so I would like to obtain a code to get this done without the need to mention rows or column values.

Thanks!

zoowalk · May 10, 2022, 6:49am

Does this help?

library(tidyverse)
subject <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C")
year <- c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004)
treat <- c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0)
df1 <- data.frame(subject, year, treat)

first_treat <- c(2003, 2003, 2003, 2003, 2003, 2001, 2001, 2001, 2001, 2001, 0, 0, 0, 0, 0)
df2 <- data.frame(subject, year, treat, first_treat)

OPTION 1
df1 %>% 
  group_by(subject) %>% 
  arrange(year, .by_group = T) %>% 
  mutate(first_treat=case_when(
    treat==1  & lag(treat==0) ~ year,
    TRUE ~ 0
  )) %>% 
  mutate(first_treat=max(first_treat))
#> # A tibble: 15 × 4
#> # Groups:   subject [3]
#>    subject  year treat first_treat
#>    <chr>   <dbl> <dbl>       <dbl>
#>  1 A        2000     0        2003
#>  2 A        2001     0        2003
#>  3 A        2002     0        2003
#>  4 A        2003     1        2003
#>  5 A        2004     1        2003
#>  6 B        2000     0        2001
#>  7 B        2001     1        2001
#>  8 B        2002     1        2001
#>  9 B        2003     1        2001
#> 10 B        2004     1        2001
#> 11 C        2000     0           0
#> 12 C        2001     0           0
#> 13 C        2002     0           0
#> 14 C        2003     0           0
#> 15 C        2004     0           0

OPTION 2
df1 %>% 
  group_by(subject) %>% 
  arrange(year, .by_group = T) %>% 
  mutate(first_treat=min(year[treat==1])) %>% 
  mutate(first_treat=case_when(first_treat==Inf ~ 0,
                               TRUE ~ first_treat))
#> Warning in min(year[treat == 1]): no non-missing arguments to min; returning Inf
#> # A tibble: 15 × 4
#> # Groups:   subject [3]
#>    subject  year treat first_treat
#>    <chr>   <dbl> <dbl>       <dbl>
#>  1 A        2000     0        2003
#>  2 A        2001     0        2003
#>  3 A        2002     0        2003
#>  4 A        2003     1        2003
#>  5 A        2004     1        2003
#>  6 B        2000     0        2001
#>  7 B        2001     1        2001
#>  8 B        2002     1        2001
#>  9 B        2003     1        2001
#> 10 B        2004     1        2001
#> 11 C        2000     0           0
#> 12 C        2001     0           0
#> 13 C        2002     0           0
#> 14 C        2003     0           0
#> 15 C        2004     0           0

^{Created on 2022-05-10 by the reprex package (v2.0.1)}

marcrocalleva · May 10, 2022, 9:03am

zoowalk:

df1 %>% 
  group_by(subject) %>% 
  arrange(year, .by_group = T) %>% 
  mutate(first_treat=min(year[treat==1])) %>% 
  mutate(first_treat=case_when(first_treat==Inf ~ 0,
                               TRUE ~ first_treat))

Hi! In both cases i have this error: "Error in order(year, .by_group = T) : argument lengths differ". How could i solve it?

zoowalk · May 10, 2022, 9:18am

Do you get the error when running the code on the sample data, or when running it on your own, more comprehensive dataset? I haven't seen this error message before.

marcrocalleva · May 10, 2022, 9:28am

I have it in both cases, in my dataset and the sample I provided here. Maybe there is an alternative way to do it? Is there an additional pachage I should install?

zoowalk · May 10, 2022, 9:31am

zoowalk:

library(tidyverse)
subject <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C")
year <- c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004)
treat <- c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0)
df1 <- data.frame(subject, year, treat)

first_treat <- c(2003, 2003, 2003, 2003, 2003, 2001, 2001, 2001, 2001, 2001, 0, 0, 0, 0, 0)
df2 <- data.frame(subject, year, treat, first_treat)

OPTION 1
df1 %>% 
  group_by(subject) %>% 
  arrange(year, .by_group = T) %>% 
  mutate(first_treat=case_when(
    treat==1  & lag(treat==0) ~ year,
    TRUE ~ 0
  )) %>% 
  mutate(first_treat=max(first_treat))

Ok, i think I know why it occurs. If you look at my example, you'll see that I wrapped the subject letters in hyphens ("A", "A", etc) so that R recognizes them as distinct elements. Otherwise, it won't work.

marcrocalleva · May 10, 2022, 9:54am

zoowalk:

subject <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C")
year <- c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004)
treat <- c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0)
df1 <- data.frame(subject, year, treat)

first_treat <- c(2003, 2003, 2003, 2003, 2003, 2001, 2001, 2001, 2001, 2001, 0, 0, 0, 0, 0)
df2 <- data.frame(subject, year, treat, first_treat)

OPTION 1
df1 %>% 
  group_by(subject) %>% 
  arrange(year, .by_group = T) %>% 
  mutate(first_treat=case_when(
    treat==1  & lag(treat==0) ~ year,
    TRUE ~ 0
  )) %>% 
  mutate(first_treat=max(first_treat))

I still have the same error even if i set the subject variable as ("A", "A", ...)

zoowalk · May 10, 2022, 9:55am

If you just copy my code, paste it into your RStudio, and run it - you get the error?

marcrocalleva · May 10, 2022, 10:14am

Yes, I have copied and run it in Rstudio and I still have the same error

system · May 31, 2022, 10:15am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.