How can I condense all the information in a single row? I know I have to use pivot_longer but I don't know how to solve it.

juandmaz · January 13, 2024, 5:32am

Hi, I'm stuck with this problem and I don't know how to solve it.
I have this base

id	math	X	history	Y	geo	Z
1	1/12/2023	A	9/10/2023	A	14/9/2023	A
1	15/9/2023	B	4/7/2023	D	19/2/2023	D
2	18/12/2023	A	6/7/2023	C	10/4/2023	A
2	30/8/2023	C	5/6/2023	B	24/12/2023	C
3	5/6/2023	D	19/3/2023	A	27/5/2023	D
3	11/11/2023	B	14/6/2023	A	27/9/2023	B

and I need each id to have only one observation like this

id	math_1	math_2	math_x_1	math_x_2	history_1	history_2	history_y_1	history_y_2	geo_1	geo_2	geo_z_1	geo_z_2
1	1/12/2023	15/9/2023	A	B	9/10/2023	4/7/2023	A	D	14/9/2023	19/2/2023	A	D
2	18/12/2023	30/8/2023	A	C	6/7/2023	5/6/2023	C	B	10/4/2023	24/12/2023	A	C
3	5/6/2023	11/11/2023	D	B	19/3/2023	14/6/2023	A	A	27/5/2023	27/9/2023	D	B

How can I condense all the information in a single row? I know I have to use pivot_longer but I don't know how to solve it.

FJCC · January 13, 2024, 4:54pm

I would do it like this:

library(dplyr)
library(tidyr)

DF <- read.csv("~/R/Play/Dummy.csv")
DF
#>   id       math X   history Y        geo Z
#> 1  1  1/12/2023 A 9/10/2023 A  14/9/2023 A
#> 2  1  15/9/2023 B  4/7/2023 D  19/2/2023 D
#> 3  2 18/12/2023 A  6/7/2023 C  10/4/2023 A
#> 4  2  30/8/2023 C  5/6/2023 B 24/12/2023 C
#> 5  3   5/6/2023 D 19/3/2023 A  27/5/2023 D
#> 6  3 11/11/2023 B 14/6/2023 A  27/9/2023 B
colnames(DF)[seq(3,7,2)] <- paste(colnames(DF)[seq(2,6,2)], 
                                  colnames(DF)[seq(3,7,2)], sep = "_")
DF
#>   id       math math_X   history history_Y        geo geo_Z
#> 1  1  1/12/2023      A 9/10/2023         A  14/9/2023     A
#> 2  1  15/9/2023      B  4/7/2023         D  19/2/2023     D
#> 3  2 18/12/2023      A  6/7/2023         C  10/4/2023     A
#> 4  2  30/8/2023      C  5/6/2023         B 24/12/2023     C
#> 5  3   5/6/2023      D 19/3/2023         A  27/5/2023     D
#> 6  3 11/11/2023      B 14/6/2023         A  27/9/2023     B

DF |> group_by(id) |> mutate(subrow = row_number()) |> 
  pivot_longer(cols = -c("id", "subrow"), names_to = "Var") |> 
  pivot_wider(names_from = c("Var", "subrow"), names_sep = "_", values_from = "value")
#> # A tibble: 3 × 13
#> # Groups:   id [3]
#>      id math_1     math_X_1 history_1 history_Y_1 geo_1  geo_Z_1 math_2 math_X_2
#>   <int> <chr>      <chr>    <chr>     <chr>       <chr>  <chr>   <chr>  <chr>   
#> 1     1 1/12/2023  A        9/10/2023 A           14/9/… A       15/9/… B       
#> 2     2 18/12/2023 A        6/7/2023  C           10/4/… A       30/8/… C       
#> 3     3 5/6/2023   D        19/3/2023 A           27/5/… D       11/11… B       
#> # ℹ 4 more variables: history_2 <chr>, history_Y_2 <chr>, geo_2 <chr>,
#> #   geo_Z_2 <chr>

^{Created on 2024-01-13 with reprex v2.0.2}

vedoa · January 13, 2024, 5:03pm

Hi @juandmaz ,

my approach

d <- data.frame(id = c(1,1,2,2,3,3),
                math = c("01.12.2023",	"15.09.2023",	"18.12.2023",	"30.08.2023",	"05.06.2023",	"11.11.2023"),
                X =	c("A",	"B",	"A", "C",	"D",	"B"),
                history = c("09.10.2023","04.07.2023",	"06.07.2023",	"05.06.2023",	"19.03.2023",	"14.06.2023"),
                Y = c("A", "D",	"C", "B", "A", "A"),
                geo = c("14.09.2023",	"19.02.2023",	"10.04.2023",	"24.12.2023",	"27.05.2023",	"27.09.2023"),
                Z = c("A","D","A","C","D","B"))

library(tidyr)
library(dplyr)

dFinal <- d |> 
  group_by(id) |> 
  mutate(forWider = 1:n()) |>
  pivot_wider(names_from = forWider, values_from = !(c(id, forWider)))

dromano · January 13, 2024, 9:34pm

Hi Juan,

I'm curious, if you don't mind sharing: What is your use case? Why do you want to create a single row per id value? In an application to calculating grades (which this looks like), I would think that a longer version would be more useful than a wider version.

juandmaz · January 14, 2024, 4:55am

FJCC:

|> group_by(id) |> mutate(subrow = row_number()) |> 
  pivot_longer(cols = -c("id", "subrow"), names_to = "Var") |> 
  pivot_wider(names_from = c("Var", "subrow"), names_sep = "_", values_from = "value")

Hi, thanks for the help. The problem with the solution you gave me is that in my df the variables "history", "math" and "geo" are dates and in your df there aren't. So I can't use your code because I can't combine dates with character.

juandmaz · January 14, 2024, 5:00am

Hello, no problem with the question. Actually, I have to do this same but with a much longer base and with values of another type. This example I made was just as a reprex. So it may look that it's not worth doing all this for this example, but I really need it for another case.

dromano · January 14, 2024, 3:38pm

I didn't mean to suggest it wasn't worth it, just that sometimes one's actual use case may help folks find a better solution than they might come up with based on an understanding informed by a limited example. It looks like you were satisfied with @ vedoa's solution, but if you're up for it, I would be curious to know more about your use case. Otherwise, best wishes on further progress with your work.

juandmaz · January 15, 2024, 8:26pm

Hi, I have a very large database with many repeated IDs and I find more optimal to have a database with one observation per ID for cleaning, processing, validation and cross analysis of variables.
I think it's easier that way.

system · January 22, 2024, 8:27pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.