Transforming data from wide to long format with multiple grouping variables

Hi, I'm working on data that is in a wide format and needs to be converted into a long format for analysis. I have used pivot_longer() for the same tasks previously which works well. In my present analysis, I need to convert the data into a long format where there is more than 1 variable that needs to be converted.

Example data

mydata <- 
  data.frame(
    id = c(1, 2, 3),
    week1 = c(1, 4, 2), # days
    week1_lab1 = c(120, 140, 257),
    week1_lab2 = c(10, 24, 14),
    week1_lab3 = c(50, 34, 64),
    week2 = c(8, 9, 13), # days
    week2_lab1 = c(122, 124, 143),
    week2_lab2 = c(18, 13, 14),
    week2_lab3 = c(74, 75, 84),
    week3 = c(16, 19, 20), # days
    week3_lab1 = c(198, 125, 231),
    week3_lab2 = c(19, 13, 15),
    week3_lab3 = c(47, 84, 37)
  )

> mydata

  id week1 week1_lab1 week1_lab2 week1_lab3 week2 week2_lab1 week2_lab2
1  1     1        120         10         50     8        122         18
2  2     4        140         24         34     9        124         13
3  3     2        257         14         64    13        143         14
  week2_lab3 week3 week3_lab1 week3_lab2 week3_lab3
1         74    16        198         19         47
2         75    19        125         13         84
3         84    20        231         15         37

The expected output

  id days  week lab1 lab2 lab3
1  1    1 week1  120   10   50
2  1    8 week2  122   18   74
3  1   16 week3  198   19   47
4  2    4 week1  140   24   34
5  2    9 week2  124   13   75
6  2   19 week3  125   13   84
7  3    2 week1  257   14   64
8  3   13 week2  143   14   84
9  3   20 week3  231   15   37

The real data does not have a structured naming convention for the columns as given in the example data.

I tried using the pivot_longer approach which works well for single variable, but could not find a solution to the above output which involves multiple variables.

mydata_lab_long <-
  pivot_longer(
    mydata,
    cols = c("week1_lab1","week2_lab1","week3_lab1"),
    names_to = "lab1_test",
    values_to = "lab1_values"
  )
> mydata_lab_long
# A tibble: 9 × 12
     id week1 week1_lab2 week1_lab3 week2 week2_lab2 week2_lab3 week3 week3_lab2 week3_lab3 lab1_test  lab1_values
  <dbl> <dbl>      <dbl>      <dbl> <dbl>      <dbl>      <dbl> <dbl>      <dbl>      <dbl> <chr>            <dbl>
1     1     1         10         50     8         18         74    16         19         47 week1_lab1         120
2     1     1         10         50     8         18         74    16         19         47 week2_lab1         122
3     1     1         10         50     8         18         74    16         19         47 week3_lab1         198
4     2     4         24         34     9         13         75    19         13         84 week1_lab1         140
5     2     4         24         34     9         13         75    19         13         84 week2_lab1         124
6     2     4         24         34     9         13         75    19         13         84 week3_lab1         125
7     3     2         14         64    13         14         84    20         15         37 week1_lab1         257
8     3     2         14         64    13         14         84    20         15         37 week2_lab1         143
9     3     2         14         64    13         14         84    20         15         37 week3_lab1         231

Any help is much appreciated

Please give a realistic example of column names. Are the column names totally random with no patterns at all? Does at least the position of the columns form a pattern?

Most of the column names follow a similar pattern, but cannot be sure of it.

For the day entries, it follows weekNumber_day and for the lab entries it follows weekNumber_labResultName

Example:

2wk_day represents days in the second week
8wk_day represents days in the eighth week

2wk_lab1 represents the lab1 value on week 2
4wk_lab2 represents the lab2 value on week 4

The position of the columns follows the same pattern, week followed by all the lab values in the week

Example:

week1, week1_lab1, week1_lab2, week2, week2_lab1,...

Does this help you?

mydata <- 
  data.frame(
    id = c(1, 2, 3),
    `1wk_day` = c(1, 4, 2), # days
    `1wk_lab1` = c(120, 140, 257),
    `1wk_lab2` = c(10, 24, 14),
    `1wk_lab11` = c(50, 34, 64),
    `2wk_day` = c(8, 9, 13), # days
    `2wk_lab1` = c(122, 124, 143),
    `2wk_lab2` = c(18, 13, 14),
    `2wk_lab11` = c(74, 75, 84),
    `10wk_day` = c(16, 19, 20), # days
    `10wk_lab1` = c(198, 125, 231),
    `10wk_lab2` = c(19, 13, 15),
    `10wk_lab11` = c(47, 84, 37),check.names = FALSE
  )
mydata  
#>   id 1wk_day 1wk_lab1 1wk_lab2 1wk_lab11 2wk_day 2wk_lab1 2wk_lab2 2wk_lab11
#> 1  1       1      120       10        50       8      122       18        74
#> 2  2       4      140       24        34       9      124       13        75
#> 3  3       2      257       14        64      13      143       14        84
#>   10wk_day 10wk_lab1 10wk_lab2 10wk_lab11
#> 1       16       198        19         47
#> 2       19       125        13         84
#> 3       20       231        15         37
library(tidyr)
library(dplyr)
Long1 <- mydata |> pivot_longer(cols = -id, names_pattern = "(.+)_(.+)", 
                                names_to = c("Week", "Second"))
Days <- Long1 |> filter(Second == "day") |> 
  rename(Day = value) |> select(-Second)
Labs <- Long1 |> filter(Second != "day") |> 
  pivot_wider(names_from = "Second", values_from = "value")
Final <- inner_join(Days, Labs, by = c("id","Week"))
Final
#> # A tibble: 9 × 6
#>      id Week    Day  lab1  lab2 lab11
#>   <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1     1 1wk       1   120    10    50
#> 2     1 2wk       8   122    18    74
#> 3     1 10wk     16   198    19    47
#> 4     2 1wk       4   140    24    34
#> 5     2 2wk       9   124    13    75
#> 6     2 10wk     19   125    13    84
#> 7     3 1wk       2   257    14    64
#> 8     3 2wk      13   143    14    84
#> 9     3 10wk     20   231    15    37

Created on 2023-06-13 with reprex v2.0.2

1 Like

This approach works fine.

But i have a question.
The dataset I'm working on has 50+ columns and all the column names follow something _something pattern. Should I exclude all the other columns in a similar approach cols = -id or is there a better way to do that?

I'm not sure I understand your question. Use the cols argument to designate the columns you want to pivot. I used -id to exclude the id column and pivot all the other columns. I could have written 1wk_day:10wk_lab11 to include all the columns between the two listed ones. You can use whatever is most convenient for your data.

Thank you, I'm able to transform only the required columns

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.