Assigning Variables to columns based on a nested list using dplyr/tidyverse

Regarding @FJCC's approach,

I got the output, but the total number of rows shrunk by about half (looks like it's doing unique values somehow), and it's producing columns as lists rather than individual doubles:

> typeof(posit$Col_0)
[1] "list"
> class(posit$Col_0)
[1] "list"

With all empty values as NULL rather than NA.

This solution seems to work:

tibble |>
  unnest_wider(nestedvals, names_sep = '_', transform = ~.x[2]) |>
  mutate(across(starts_with('nestedvals'), ~replace_na(., 0)))

Based on:

It just seems to take some time.

Lists are be created by pivot_wider() when deleting the values_from column results in duplicate rows:

library(tidyverse)  
tibble(
  ltr = rep(letters[1:2], each = 2), 
  num = 1:4
) -> small

small
#> # A tibble: 4 × 2
#>   ltr     num
#>   <chr> <int>
#> 1 a         1
#> 2 a         2
#> 3 b         3
#> 4 b         4
small |> 
  pivot_wider(names_from = ltr, values_from = num)
#> Warning: Values from `num` are not uniquely identified; output will contain list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(ltr)) |>
#>   dplyr::filter(n > 1L)
#> # A tibble: 1 × 2
#>   a         b        
#>   <list>    <list>   
#> 1 <int [2]> <int [2]>

Created on 2024-04-19 with reprex v2.0.2

which means there must be ID and col (index) combinations that repeat in your full data.

A fix to @FJCC's solution is to add unique identifiers for each row before running pivot_wider():

df |> 
  unnest(nestedvals) |> 
  rowwise() |> 
  mutate(
    name = nestedvals |> pluck(1), 
    value =  nestedvals |> pluck(2) 
  ) |> 
  ungroup() |> 
  mutate(row = row_number()) |> 
  pivot_wider(names_prefix = 'col', values_fill = 0) |> 
  group_by(id) |> 
  summarise( 
    across(nestedvals, list),
    across(contains('col'), sum)
  ) |> 
  select(contains(c('i', 'n', 1:10)))

Is there a logic behind the:
select(contains(c('i', 'n', 1:10)))
Or is that arbitrary?

I ask because I'd like to know if I should give it a longer range from data that can range from 1-1500 unique values for index.

Yes, the contains() function allows you to match the names of the columns to strings: "i" to id, "n" to nestedvals, and 1:10 to the numbered columns, and the columns appear in the order specified.

An alternative is to use matches() instead, which requires regular expressions:

matches('i|n|[0-9]') # matches 'i' or 'n' or the digits 0-9

but then you may lose control of the of order in which the numbered columns appear.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.