Transforming Long Form Data to Wide Form Data

DatAnnihilyst · April 4, 2021, 8:25am

Hi,

I'm trying to perform the above-mentioned task on a large data set of 1.5 million observations of 52 variables. I've provided a small sample of the data to illustrate what I'm trying to achieve. The variables which remain the same for each personal id number are Gender and Status but I'd like the other variables to be spread out so that each row shows all information for each individual.

df_Long <- data.frame(Personal_ID_Number = c(1,1,1,2,2,2,3,3,3),
                      Gender = c(0,0,0,1,1,1,0,0,0),
                      Status = c("A","A","A","C","C","C","B","B","B"),
                      Acct.Type = c("RSV","RSV","STM","FLX","RSV","STM","STM","FLX","RSV"), 
                      Desc = c("Prim","Prim","Sec","Joint","Sec","Prim","Sec","Prim","Joint"), 
                      Currency = c("JMD","JMD","GBP","USD","GBP","JMD","GBP","JMD","USD"),
                      Bal_1 = c(5000,5500,3000,6000,5000,4000,7000,6000,5000),
                      Bal_2 = c(4000,5000,5500,6500,7500,8500,9500,2500,1500))

Please assist if you can.

Cheers!

DA

vlad_aluas · April 4, 2021, 11:20am

Hi,

I have a couple of questions regarding the data you are using.

Will you have only three observations for each ID, or the number of observations can vary from ID to ID?

If the number of observations is fixed, you can simply add additional columns and append a number to them.

If the number of observations for each ID varies, it would not be advisable since you will might end up adding new columns with each new observation.

DatAnnihilyst · April 4, 2021, 3:21pm

Hi @vlad_aluas ,

The number of observations for each ID does vary. Seems as if there will be as many columns added as there are response options for Acct.Type, Desc and Currency (9 extra columns or 6 since old columns will be replaced).

I tried the spread and reshape functions but they only seem to work for using 2 of the dynamic variables, e.g.

df_Wide <- spread(df_Long, Acct.Type, Desc)

vlad_aluas · April 5, 2021, 6:21am

Hi DatAnnihilyst,

Indeed spread will not work with multiple variables.

However, from the data you have I would think a longer format is what you need and it would be the best one in which to have the data.

Is there a specific reason you want it spread like you described?

DatAnnihilyst · April 5, 2021, 7:04am

I want to construct the data that way because i think it would be easier to analyse, create charts and pivots etc. Is this possible in the long form? For example, in the wide format i could create averages of all individuals, not all accounts. But I'm open to any ideas and approaches which can extract useful analysis.

Assist if you can. Thanks for your input so far.

system · April 26, 2021, 7:04am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.