Help with pivot_longer to make questionaire dataset tidy

leonardo79 · March 8, 2024, 5:48pm

Hello everyone, I hope you're doing well. I need help transforming a specific dataframe into a long format. The data pertains to a consumer survey for two or more products. The collection is done via an online form, and I import the generated Excel file into R as a dataframe at the end of the survey.

The structure of the dataframe is as follows, with the variables in sequence:

id (chr): Consumer identification
sample (chr): Identifier of the sample evaluated in the first position
question_1 (num): First question about the sample evaluated in the first position
question_2 (num): Second question about the sample evaluated in the first position
question_3 (chr): Third question about the sample evaluated in the first position
sample: (chr) Identifier of the sample evaluated in the second position
question_1 (num): First question about the sample evaluated in the second position
question_2 (num): Second question about the sample evaluated in the second position
question_3 (chr): Third question about the sample evaluated in the second position

The above example dataframe has 9 columns, with some of them having the same column names but different observations because they are different samples. R automatically renames repeated column names (example: sample, sample.1), despite the question name being the same.

The final dataframe that I need in long format would have the following variables in sequence:
id, sample, question_1, question_2, question_3

I tried using tidyr to code with pivot_longer (based on the example from R4DS bellow), but I couldn't solve the issue. Can anyone help me?

billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank"
  )

M_AcostaCH · March 8, 2024, 6:06pm

Hi @leonardo79 , Could you put a reproducible example of data?

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

# other way is:
# If you data is data, please paste the result of 
dput(head(data,30)) # for first 30 rows with all columns. #

leonardo79 · March 8, 2024, 6:15pm

Hi, @M_AcostaCH ! Thank you for your reply and the link with information. See bellow a small dataset that can be used: (edited to post the complete code that is not working properly)

# Carregando o pacote necessário
library(tidyr)
library(dplyr)

# Criando um exemplo de dataframe
dataset <- data.frame(
 id = c("ID1", "ID2", "ID3"),
 sample = c("Sample1", "Sample2", "Sample1"),
 question_1 = c(5, 4, 7),
 question_2 = c(3, 2, 5),
 question_3 = c("Yes", "No", "No"),
 sample = c("Sample2", "Sample1", "Sample2"),
 question_1 = c(4, 5, 3),
 question_2 = c(2, 3, 4),
 question_3 = c("No", "Yes", "No")
)

# Transformando o dataframe para o formato longo
df_long <- df %>%
 pivot_longer(
   cols = -id,
   names_to = c(".value", "sample_num"),
   names_pattern = "(.*)(\\d)$"
 )

# Visualizando o dataframe transformado
print(df_long)

M_AcostaCH · March 11, 2024, 12:38am

Im try to help and find this form. A close example like you want:
Maybe a super R user could make better.

dataset <- data.frame(
 id = c("ID1", "ID2", "ID3"),
 sample = c("Sample1", "Sample2", "Sample1"),
 question_1 = c(5, 4, 7),
 question_2 = c(3, 2, 5),
 question_3 = c("Yes", "No", "No"),
 sample = c("Sample2", "Sample1", "Sample2"),
 question_1 = c(4, 5, 3),
 question_2 = c(2, 3, 4),
 question_3 = c("No", "Yes", "No")
)
# I think that the two `sample` columns is difficult for get a Q1, Q2, Q3 columns.

dataset_long <- dataset %>%
  pivot_longer(cols = starts_with("question_"), 
               names_to = "question", 
               values_to = "value") %>%
  mutate(question = gsub("question_", "Q", question)) 

#  id    sample  sample.1 question value
#  <chr> <chr>   <chr>    <chr>    <chr>
# 1 ID1   Sample1 Sample2  Q1       5    
# 2 ID1   Sample1 Sample2  Q2       3    
# 3 ID1   Sample1 Sample2  Q3       Yes  
# 4 ID1   Sample1 Sample2  Q1.1     4    
# 5 ID1   Sample1 Sample2  Q2.1     2    
# 6 ID1   Sample1 Sample2  Q3.1     No   
# 7 ID2   Sample2 Sample1  Q1       4    
# 8 ID2   Sample2 Sample1  Q2       2    
# 9 ID2   Sample2 Sample1  Q3       No   
# 10 ID2   Sample2 Sample1  Q1.1     5    
# 11 ID2   Sample2 Sample1  Q2.1     3    
# 12 ID2   Sample2 Sample1  Q3.1     Yes  
# 13 ID3   Sample1 Sample2  Q1       7    
# 14 ID3   Sample1 Sample2  Q2       5    
# 15 ID3   Sample1 Sample2  Q3       No   
# 16 ID3   Sample1 Sample2  Q1.1     3    
# 17 ID3   Sample1 Sample2  Q2.1     4    
# 18 ID3   Sample1 Sample2  Q3.1     No

jrkrideau · March 11, 2024, 2:27am

This may do what you want. Note I exported your data.frame as a .csv file and read it back in as a data.table. You can do the same thing with a data.frame, I just find it easier to use {data.table}

suppressMessages(library(data.table))

DT <- fread("leonardo.csv") 

DT1 <- DT[, .( id,  sample, question_1, question_2, question_3)]
DT2 <- DT[, .(id, sample.1, question_1.1, question_2.1, question_3.1)]

names(DT2) <- names(DT1)

DT3 <- rbind(DT1, DT2)

leonardo79 · March 11, 2024, 8:23pm

@M_AcostaCH Thank you for your kind answer! But the desired format to the new dataframe needs to have only id, sample, question_1, question_2 and question_3 columns.

leonardo79 · March 11, 2024, 8:27pm

Thank you @jrkrideau ! It's what I want (worked for this small example). How can I generalize this answer if I have 30 questions columns and 8 sample columns for example? Do you have any idea?

jrkrideau · March 11, 2024, 9:15pm

I'll have to think about this but you can just think of this but you can just keep repeating what I did for the 8 samples but with all 30 variables.

All I did was do

names(DT)

and copy the names into a template

I think it should be fairly easy to come up with a better method but probably will not be able to get to it until tomorrow morning my time (EDST North America).

leonardo79 · March 11, 2024, 9:20pm

Hi @jrkrideau ! Thank you a lot for your time. I am trying to obtain a better code to use in general situations. If I discover, I'll share here.

jrkrideau · March 12, 2024, 1:44pm

Morning (my time)leonardo79

I seem to be having a mental block on an easy way to split up that data set. However slight modification of my earlier method will cut down on a lot of finicky work. See below:

DT <- fread("leonardo.csv") 
DT1 <- DT[, id, sample:question_3]
DT2 <- DT[, id, sample.1 :question_3.1]

You can do this

names(DT2) <- names(DT3) <- names(DT4) <- names(DT1)
DTtotal <- rbind(DT1, DT2, DT3, DT4)

It is not a general solution but it may let you get some work done until I get hit by some inspiration or someone else comes along and shows us what obvious thing I'm missing.

StatSteph · March 12, 2024, 8:47pm

For this setup, I've come up with the following solution:

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- data.frame(
  id = c("ID1", "ID2", "ID3"),
  sample = c("Sample1", "Sample2", "Sample1"),
  question_1 = c(5, 4, 7),
  question_2 = c(3, 2, 5),
  question_3 = c("Yes", "No", "No"),
  sample = c("Sample2", "Sample1", "Sample2"),
  question_1 = c(4, 5, 3),
  question_2 = c(2, 3, 4),
  question_3 = c("No", "Yes", "No")
)

dataset %>%
  mutate(across(where(is.numeric), as.character)) %>%
  pivot_longer(cols = -id, 
               names_to = "question", 
               values_to = "value") %>%
  mutate(
    SampleGp=if_else(stringr::str_detect(question, "\\.1"), "B", "A"),
    question=stringr::str_remove_all(question, "\\.1")
  ) %>%
  pivot_wider(id_cols=c(id, SampleGp), values_from=value, names_from=question) %>%
  mutate(
    sample=readr::parse_number(sample),
    across(c(question_1, question_2), as.numeric)
  ) %>%
  select(-SampleGp)
#> # A tibble: 6 × 5
#>   id    sample question_1 question_2 question_3
#>   <chr>  <dbl>      <dbl>      <dbl> <chr>     
#> 1 ID1        1          5          3 Yes       
#> 2 ID1        2          4          2 No        
#> 3 ID2        2          4          2 No        
#> 4 ID2        1          5          3 Yes       
#> 5 ID3        1          7          5 No        
#> 6 ID3        2          3          4 No

^{Created on 2024-03-12 with reprex v2.1.0}

StatSteph · March 12, 2024, 8:56pm

And this solution would expand to including an arbitrary number of groups though the manual bit still exists converting question_1 and question_2 to numeric at the end.

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- data.frame(
  id = c("ID1", "ID2", "ID3"),
  sample = c("Sample1", "Sample2", "Sample1"),
  question_1 = c(5, 4, 7),
  question_2 = c(3, 2, 5),
  question_3 = c("Yes", "No", "No"),
  sample = c("Sample2", "Sample1", "Sample2"),
  question_1 = c(4, 5, 3),
  question_2 = c(2, 3, 4),
  question_3 = c("No", "Yes", "No")
)

dataset %>%
  mutate(across(where(is.numeric), as.character)) %>%
  pivot_longer(cols = -id, 
               names_to = "question", 
               values_to = "value") %>%
  separate_wider_delim(cols=question, delim=".", names=c("Question", "suffix"), too_few="align_start") %>%
  pivot_wider(id_cols=c(id, suffix), values_from=value, names_from=Question) %>%
  mutate(
    sample=readr::parse_number(sample),
    across(c(question_1, question_2), as.numeric)
  ) %>%
  select(-suffix)
#> # A tibble: 6 × 5
#>   id    sample question_1 question_2 question_3
#>   <chr>  <dbl>      <dbl>      <dbl> <chr>     
#> 1 ID1        1          5          3 Yes       
#> 2 ID1        2          4          2 No        
#> 3 ID2        2          4          2 No        
#> 4 ID2        1          5          3 Yes       
#> 5 ID3        1          7          5 No        
#> 6 ID3        2          3          4 No

^{Created on 2024-03-12 with reprex v2.1.0}

leonardo79 · March 13, 2024, 5:25pm

Thank you Stephanie @StatSteph !
An excellent option to solve the problem of keeping the data Tidy. I understood the possibility of generalizing the answer. With the insights obtained here, I am trying to create a code to address more general cases. If I succeed, I will share it here.

jrkrideau · March 13, 2024, 8:06pm

Saved! I knew someone would get me out of my dead-end approach!

system · March 20, 2024, 11:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.