assign numbers/id based on a grouping variable

Dallak · May 5, 2022, 9:35pm

Dear all,

I have a dataset of this type:

subject     start_time     word     label
s1             1.00         go        g
s1             1.00         go        g        
s1             1.20         eat       t
s1             1.20         eat       t
s2             1.00         go        g
s2             1.00         go        g        
s2             1.20         eat       t
s2             1.20         eat       t
s3             1.00         go        g
s3             1.00         go        g        
s3             1.20         eat       t
s3             1.20         eat       t

and I would like to create an id column based subject and start_time. The output I am looking for is similar to this:

subject     start_time     word         id
s1             1.00         go           1
s1             1.00         go           1
s1             1.20         eat          2
s1             1.20         eat          2
s2             1.00         go           1
s2             1.00         go           1
s2             1.20         eat          2
s2             1.20         eat          2
s3             1.00         go           1
s3             1.00         go           1
s3             1.20         eat          2
s3             1.20         eat          2

Note that for each duplicate start_time within subject I want to assign one fixed number/value. I tried to use

dat %>% 
  group_by(subject, start_time) %>% 
  mutate(
    id= row_number())

But this did not work.

I also tried using this:

dat %>%
    group_by(subject, start_time) %>%
    mutate(id= cur_group_id())

it worked but the number continues progressively for each subject. for s1 it gives 1 1 2 2, for s2 it gives 3 3 4 4, for s3 it gives 5 5 6 6, and so on, which is not the thing I am looking for.

I would appreciate your help.
Thanks in advance!

jrmuirhead · May 6, 2022, 12:01am

Hi @Dallak ,
I would approach it by grouping by start _time regardless of which subject it is, and assigning the group id.

library("tidyverse")

d <- tibble::tribble(
  ~subject, ~start_time, ~word, ~label,
   "s1",            1.00,      "go"   ,     "g",
   "s1",            1.00,         "go",      "g",
   "s1",             1.20,         "eat",       "t",
   "s1",             1.20,         "eat" ,      "t",
   "s2",             1.00,         "go"   ,     "g",
   "s2",             1.00,         "go"    ,    "g",
   "s2",            1.20,         "eat"     ,  "t",
   "s2",            1.20,         "eat"   ,    "t",
   "s3",            1.00,         "go"    ,    "g",
   "s3",            1.00,         "go"    ,    "g",
   "s3",             1.20,         "eat"  ,     "t",
   "s3",             1.20,         "eat"  ,     "t"
  )

d <- d %>% 
  group_by(start_time) %>%
  mutate(id = cur_group_id())

d
#> # A tibble: 12 × 5
#> # Groups:   start_time [2]
#>    subject start_time word  label    id
#>    <chr>        <dbl> <chr> <chr> <int>
#>  1 s1             1   go    g         1
#>  2 s1             1   go    g         1
#>  3 s1             1.2 eat   t         2
#>  4 s1             1.2 eat   t         2
#>  5 s2             1   go    g         1
#>  6 s2             1   go    g         1
#>  7 s2             1.2 eat   t         2
#>  8 s2             1.2 eat   t         2
#>  9 s3             1   go    g         1
#> 10 s3             1   go    g         1
#> 11 s3             1.2 eat   t         2
#> 12 s3             1.2 eat   t         2

^{Created on 2022-05-05 by the reprex package (v2.0.1)}

Dallak · May 6, 2022, 12:44am

Thanks, @jrmuirhead, for this.
This works on this simple case. But in my actual data the numbering continues. For example, subject1 has about 8 different values for start_time the number works fine from 1 o 8, but with the next subject, the numbering continues from 9 to n. It does not restart from 1 to n. Note that each subject has their own start_time which does not always correspond to other subject.

I want it to restart from 1 to n for each start_time value within subject and word. Sorry if the example I provided earlier did not represent the data correctly.

Here is a more representative data building on yours:

d <- tibble::tribble(
    ~subject, ~start_time, ~word, ~label,
    "s1",            1.00,      "go"   ,     "g",
    "s1",            1.00,         "go",      "g",
    "s1",             1.20,         "eat",       "t",
    "s1",             1.20,         "eat" ,      "t",
    "s2",             2.00,         "go"   ,     "g",
    "s2",             2.00,         "go"    ,    "g",
    "s2",            2.20,         "eat"     ,  "t",
    "s2",            2.20,         "eat"   ,    "t",
    "s3",            3.00,         "go"    ,    "g",
    "s3",            3.00,         "go"    ,    "g",
    "s3",             3.20,         "eat"  ,     "t",
    "s3",             3.20,         "eat"  ,     "t"
)

when I follow the approach you kindly suggested, it gives the following output:

d <- d %>% 
    rowid_to_column(., var = "rowid") %>%
    group_by(start_time) %>%
    mutate(id = cur_group_id()) %>%
    ungroup() %>%
    arrange(rowid)

d

# A tibble: 12 x 6
   rowid subject start_time word  label    id
   <int> <chr>        <dbl> <chr> <chr> <int>
 1     1 s1             1   go    g         1
 2     2 s1             1   go    g         1
 3     3 s1             1.2 eat   t         2
 4     4 s1             1.2 eat   t         2
 5     5 s2             2   go    g         3
 6     6 s2             2   go    g         3
 7     7 s2             2.2 eat   t         4
 8     8 s2             2.2 eat   t         4
 9     9 s3             3   go    g         5
10    10 s3             3   go    g         5
11    11 s3             3.2 eat   t         6
12    12 s3             3.2 eat   t         6

Note how the numbering continues. Here is the output that I want to get.

# A tibble: 12 x 6
   rowid subject start_time word  label    id
   <int> <chr>        <dbl> <chr> <chr> <int>
 1     1 s1             1   go    g         1
 2     2 s1             1   go    g         1
 3     3 s1             1.2 eat   t         2
 4     4 s1             1.2 eat   t         2
 5     5 s2             2   go    g         1
 6     6 s2             2   go    g         1
 7     7 s2             2.2 eat   t         2
 8     8 s2             2.2 eat   t         2
 9     9 s3             3   go    g         1
10    10 s3             3   go    g         1
11    11 s3             3.2 eat   t         2
12    12 s3             3.2 eat   t         2

I apologize again for not making this clear earlier.

jrmuirhead · May 6, 2022, 1:12am

Hi @Dallak ,
Thanks for the more representative sample. In order to address the different start times, I based the id on the dense rank of the start_time within each subject.

library("tidyverse")

d <- tibble::tribble(
    ~subject, ~start_time, ~word, ~label,
    "s1",            1.00,      "go"   ,     "g",
    "s1",            1.00,         "go",      "g",
    "s1",             1.20,         "eat",       "t",
    "s1",             1.20,         "eat" ,      "t",
    "s2",             2.00,         "go"   ,     "g",
    "s2",             2.00,         "go"    ,    "g",
    "s2",            2.20,         "eat"     ,  "t",
    "s2",            2.20,         "eat"   ,    "t",
    "s3",            3.00,         "go"    ,    "g",
    "s3",            3.00,         "go"    ,    "g",
    "s3",             3.20,         "eat"  ,     "t",
    "s3",             3.20,         "eat"  ,     "t"
)

d <- d %>% 
  group_by(subject) %>%
  mutate(id = dense_rank(start_time))

d
#> # A tibble: 12 × 5
#> # Groups:   subject [3]
#>    subject start_time word  label    id
#>    <chr>        <dbl> <chr> <chr> <int>
#>  1 s1             1   go    g         1
#>  2 s1             1   go    g         1
#>  3 s1             1.2 eat   t         2
#>  4 s1             1.2 eat   t         2
#>  5 s2             2   go    g         1
#>  6 s2             2   go    g         1
#>  7 s2             2.2 eat   t         2
#>  8 s2             2.2 eat   t         2
#>  9 s3             3   go    g         1
#> 10 s3             3   go    g         1
#> 11 s3             3.2 eat   t         2
#> 12 s3             3.2 eat   t         2

^{Created on 2022-05-05 by the reprex package (v2.0.1)}

Dallak · May 6, 2022, 1:19am

Awesome!
That is a very clever approach and thinking, @jrmuirhead!
Thank you so much!

system · May 13, 2022, 1:19am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.