[Dplyr + Tidyr] "Filling" Different Columns rowwise() with mutate without a loop

TPDeRamus · July 23, 2024, 2:02am

Hello once again Posit Community.

I have some data that has unfortunately come to me as rather ugly character variables that I need to convert into a dataframe. I think I have something of a solution, but I would like to see if it's possible without resorting to a loop and if the presence of lists are going to be an issue.

First, the data looks something like this:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Rating = c("[(4, 0.04), (5, 0.05)]", "[(1, 0.01), (2, 0.02), (3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]", "[(2, 0.02), (3, 0.3), (4, 0.04)]", "[(1, 0.01), (4, 0.04), (5, 0.05)]", "[(3, 0.3), (4, 0.04)]", "[(3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]")
  )

Participant	Rating
Greg	[(4, 0.04), (5, 0.05)]
Greg	[(1, 0.01), (2, 0.02), (3, 0.3), (4, 0.04), (5, 0.05)]
Donna	[(4, 0.04), (5, 0.05)]
Donna	[(2, 0.02), (3, 0.3), (4, 0.04)]
Jonathan	[(1, 0.01), (4, 0.04), (5, 0.05)]
Jonathan	[(3, 0.3), (4, 0.04)]
Lewis	[(3, 0.3), (4, 0.04), (5, 0.05)]
Lewis	[(4, 0.04), (5, 0.05)]

And it needs to look more like the following:

Participant	Rating_01	Rating_02	Rating_03	Rating_04	Rating_05
Greg	0	0	0	0.04	0.05
Greg	0.01	0.02	0.3	0.04	0.05
Donna	0	0	0	0.04	0.05
Donna	0	0.02	0.3	0.04	0
Jonathan	0.01	0	0	0.04	0.05
Jonathan	0	0	0.3	0.04	0
Lewis	0	0	0.3	0.04	0.05
Lewis	0	0	0	0.04	0.05

I have written a small function to split the data and assign names:

   ratersplit<- function(x) {
     cleaner <- str_replace_all(x,c(`\\[` = "", `\\]` = "", `\\(` = "", `\\)`=""))
     len <- length(unlist(strsplit(cleaner,", ")))
     nameout <- paste0("Rating_",str_pad(unlist(strsplit(cleaner,", "))[c(TRUE,FALSE)], 2, pad = "0"))
     varout <- as.numeric(unlist(strsplit(cleaner,", "))[c(FALSE,TRUE)])
     return(list(ratcols = nameout, ratvars = varout))
   }

Which returns the names and values of any given row that it is given:

> ratersplit(df$Rating[1])
$ratcols
[1] "Rating_04" "Rating_05"

$ratvars
[1] 0.04 0.05

And appended and filled the empty columns with 0's so that they can be "filled" easier:

df[, setdiff((paste0("Rating_",str_pad(1:5,2,pad = "0"))),names(df))] <- 0

Participant	Rating	Rating_01	Rating_02	Rating_03	Rating_04	Rating_05
Greg	[(4, 0.04), (5, 0.05)]	0	0	0	0	0
Greg	[(1, 0.01), (2, 0.02), (3, 0.3), (4, 0.04), (5, 0.05)]	0	0	0	0	0
Donna	[(4, 0.04), (5, 0.05)]	0	0	0	0	0
Donna	[(2, 0.02), (3, 0.3), (4, 0.04)]	0	0	0	0	0
Jonathan	[(1, 0.01), (4, 0.04), (5, 0.05)]	0	0	0	0	0
Jonathan	[(3, 0.3), (4, 0.04)]	0	0	0	0	0
Lewis	[(3, 0.3), (4, 0.04), (5, 0.05)]	0	0	0	0	0
Lewis	[(4, 0.04), (5, 0.05)]	0	0	0	0	0

But I can't think of a way to utilize rowwise() to with mutate_at(c(x$ratcols) = x$ratvals) and fill the corresponding values that might not just error or fill the columns with the lists instead of the raw values.

And I fear that using a loop will have the same problem.

Would happily welcome any suggestions you would be willing to offer.

Thank you in advance!

TPDeRamus · July 23, 2024, 2:25am

As an update, something like this works:

  for(n in 1:nrow(df)){
    temp <- ratersplit(df$Rating[n])
    df[n,c(temp$ratcols)] <- as.list(temp$ratvars)
  }

But it takes a while and seems inefficient.

FJCC · July 23, 2024, 4:18am

Here is a solution that gets most of what you want without loops or rowwise(). It has NA instead of 0 in the missing cases and the columns are not in your desired order, but both of those are easily remedied and may to matter. I'm sure this can be improved.

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.3.3
df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Rating = c("[(4, 0.04), (5, 0.05)]", "[(1, 0.01), (2, 0.02), (3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]", "[(2, 0.02), (3, 0.3), (4, 0.04)]", "[(1, 0.01), (4, 0.04), (5, 0.05)]", "[(3, 0.3), (4, 0.04)]", "[(3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]")
)


df2 <- df |> mutate(Rating = str_replace_all(Rating, "\\),", ";"),
                    Rating = str_replace_all(Rating, "[\\(\\)\\[\\]]", ""),
                    ROW = row_number()) |> 
  separate_wider_delim(Rating, delim = ";", names = paste0("R_", 1:5),
                       too_few = "align_start") |> 
  pivot_longer(cols = starts_with("R_"), names_to = "R", values_to = "Value") |> 
  na.omit() |> 
  separate_wider_delim(cols = "Value", delim = ",", names = c("Rating", "Score")) |> 
  mutate(Rating = str_trim(Rating), Score = str_trim(Score)) |> 
  select(ROW, Participant, Rating, Score) |> 
  pivot_wider(names_from = "Rating", values_from = "Score", names_prefix = "Rating_")

df2
#> # A tibble: 8 × 7
#>     ROW Participant Rating_4 Rating_5 Rating_1 Rating_2 Rating_3
#>   <int> <chr>       <chr>    <chr>    <chr>    <chr>    <chr>   
#> 1     1 Greg        0.04     0.05     <NA>     <NA>     <NA>    
#> 2     2 Greg        0.04     0.05     0.01     0.02     0.3     
#> 3     3 Donna       0.04     0.05     <NA>     <NA>     <NA>    
#> 4     4 Donna       0.04     <NA>     <NA>     0.02     0.3     
#> 5     5 Johnathan   0.04     0.05     0.01     <NA>     <NA>    
#> 6     6 Johnathan   0.04     <NA>     <NA>     <NA>     0.3     
#> 7     7 Lewis       0.04     0.05     <NA>     <NA>     0.3     
#> 8     8 Lewis       0.04     0.05     <NA>     <NA>     <NA>

^{Created on 2024-07-22 with reprex v2.0.2}

craig.parylo · July 23, 2024, 8:32am

I think @FJCC nailed it. All I can think of is the below where I've tried to improve the extraction of the rating and score using regex:

library(tidyverse)

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Rating = c("[(4, 0.04), (5, 0.05)]", "[(1, 0.01), (2, 0.02), (3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]", "[(2, 0.02), (3, 0.3), (4, 0.04)]", "[(1, 0.01), (4, 0.04), (5, 0.05)]", "[(3, 0.3), (4, 0.04)]", "[(3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]")
)

df2 <- df |> 
  mutate(
    id = row_number(),
    # extract contents between brackets as a list
    rating = str_extract_all(
      string = Rating, 
      pattern = '\\d, [0-9|.]*'
    )
  ) |> 
  unnest(cols = rating) |> # unnest to one row each
  separate_wider_delim( # extract rating and scores to separate columns
    cols = rating, 
    delim = ', ', 
    names = c('rating', 'score')
  ) |> 
  mutate(score = score |> as.numeric()) |> # parse as a number
  pivot_wider( # pivot ratings to columns, prefix name and fill blanks
    names_from = rating,
    names_prefix = 'Rating_',
    values_from = score,
    values_fill = 0,
    names_sort = T
  ) |> 
  arrange(id) |> # arrange by order presented in df
  select(-id) # remove the id

df2
#> # A tibble: 8 × 7
#>   Participant Rating                Rating_1 Rating_2 Rating_3 Rating_4 Rating_5
#>   <chr>       <chr>                    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 Greg        [(4, 0.04), (5, 0.05…     0        0         0       0.04     0.05
#> 2 Greg        [(1, 0.01), (2, 0.02…     0.01     0.02      0.3     0.04     0.05
#> 3 Donna       [(4, 0.04), (5, 0.05…     0        0         0       0.04     0.05
#> 4 Donna       [(2, 0.02), (3, 0.3)…     0        0.02      0.3     0.04     0   
#> 5 Johnathan   [(1, 0.01), (4, 0.04…     0.01     0         0       0.04     0.05
#> 6 Johnathan   [(3, 0.3), (4, 0.04)]     0        0         0.3     0.04     0   
#> 7 Lewis       [(3, 0.3), (4, 0.04)…     0        0         0.3     0.04     0.05
#> 8 Lewis       [(4, 0.04), (5, 0.05…     0        0         0       0.04     0.05

^{Created on 2024-07-23 with reprex v2.1.0}

nirgrahamuk · July 23, 2024, 9:24am

I noticed that the strings of Ratings, are close to JSON but for the round brackets not being square. so I write a simple function to swap the brackets, this means reading as JSON returns a matrix where first column is index to the rating (1 through 5) , and second column is the rating value.

library(tidyverse)
library(jsonlite)
df <- data.frame(
  Participant = c("Greg", "Greg", "Donna", "Donna", "Johnathan", "Johnathan", "Lewis", "Lewis"),
  Rating = c("[(4, 0.04), (5, 0.05)]", "[(1, 0.01), (2, 0.02), (3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]", "[(2, 0.02), (3, 0.3), (4, 0.04)]", "[(1, 0.01), (4, 0.04), (5, 0.05)]", "[(3, 0.3), (4, 0.04)]", "[(3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]")
)

round_to_square <- function(x) {
  str_replace_all(x, pattern = c("\\(" = "[", "\\)" = "]"))
}

df |>
  rowwise() |>
  mutate(Rating = list(fromJSON(round_to_square(Rating),
    simplifyMatrix = FALSE
  ))) |>
  ungroup() |>
  mutate(rn = row_number()) |>
  unnest(cols = Rating) |>
  unnest_wider(col = Rating, names_sep = "_") |>
  arrange(Rating_1) |>
  mutate(Rating_1 = paste0("Rating_",Rating_1)) |> 
  pivot_wider(
    id_cols = c(Participant, rn), names_from = Rating_1,
    values_from = Rating_2,
    values_fill = 0
  ) |>
  arrange(rn)

# A tibble: 8 × 7
  Participant    rn Rating_1 Rating_2 Rating_3 Rating_4 Rating_5
  <chr>       <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Greg            1     0        0         0       0.04     0.05
2 Greg            2     0.01     0.02      0.3     0.04     0.05
3 Donna           3     0        0         0       0.04     0.05
4 Donna           4     0        0.02      0.3     0.04     0   
5 Johnathan       5     0.01     0         0       0.04     0.05
6 Johnathan       6     0        0         0.3     0.04     0   
7 Lewis           7     0        0         0.3     0.04     0.05
8 Lewis           8     0        0         0       0.04     0.05

dromano · July 23, 2024, 11:36am

Here's another variation:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Rating = c("[(4, 0.04), (5, 0.05)]", "[(1, 0.01), (2, 0.02), (3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]", "[(2, 0.02), (3, 0.3), (4, 0.04)]", "[(1, 0.01), (4, 0.04), (5, 0.05)]", "[(3, 0.3), (4, 0.04)]", "[(3, 0.3), (4, 0.04), (5, 0.05)]", "[(4, 0.04), (5, 0.05)]")
  )

library(tidyverse)
df |> 
  mutate(row = row_number()) |>  # to prevent creation of list columns later
  separate_longer_delim(Rating, '),') |> 
  separate_wider_delim(Rating, ', ', names = c('rating', 'value')) |> 
  mutate(across(rating:value, parse_number)) |> 
  arrange(rating) |> 
  pivot_wider(names_from = rating, names_glue = 'rating_{rating}') |> 
  arrange(row)
#> # A tibble: 8 × 7
#>   Participant   row rating_1 rating_2 rating_3 rating_4 rating_5
#>   <chr>       <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 Greg            1    NA       NA        NA       0.04     0.05
#> 2 Greg            2     0.01     0.02      0.3     0.04     0.05
#> 3 Donna           3    NA       NA        NA       0.04     0.05
#> 4 Donna           4    NA        0.02      0.3     0.04    NA   
#> 5 Johnathan       5     0.01    NA        NA       0.04     0.05
#> 6 Johnathan       6    NA       NA         0.3     0.04    NA   
#> 7 Lewis           7    NA       NA         0.3     0.04     0.05
#> 8 Lewis           8    NA       NA        NA       0.04     0.05

^{Created on 2024-07-23 with reprex v2.0.2}

TPDeRamus · July 23, 2024, 1:34pm

Unfortunate outcome on the actual data.

Both @FJCC and @craig.parylo's answers are throwing the following error:

Error in `separate_wider_delim()`:
! Expected 2 pieces in each element of `Value`.
! 22 values were too short.
ℹ Use `too_few = "debug"` to diagnose the problem.
ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
! 1303485 values were too long.
ℹ Use `too_many = "debug"` to diagnose the problem.
ℹ Use `too_many = "drop"/"merge"` to silence this message.
Run `rlang::last_trace()` to see where the error occurred.

And @dromano threw the following:

Error in `separate_wider_delim()`:
! Expected 2 pieces in each element of `Rating`.
! 22 values were too short.
ℹ Use `too_few = "debug"` to diagnose the problem.
ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
Run `rlang::last_trace()` to see where the error occurred.

Trying @nirgrahamuk's at the moment but it's taking a bit to run through it all.

dromano · July 23, 2024, 2:03pm

This seems to suggest that some places where the sample data has commas, the actual data does not: Could you double-check to see how the sample and actual rating strings compare?