Assigning Variables to columns based on a nested list using dplyr/tidyverse

TPDeRamus · April 19, 2024, 9:13pm

Regarding @FJCC's approach,

I got the output, but the total number of rows shrunk by about half (looks like it's doing unique values somehow), and it's producing columns as lists rather than individual doubles:

> typeof(posit$Col_0)
[1] "list"
> class(posit$Col_0)
[1] "list"

With all empty values as NULL rather than NA.

TPDeRamus · April 19, 2024, 10:33pm

This solution seems to work:

tibble |>
  unnest_wider(nestedvals, names_sep = '_', transform = ~.x[2]) |>
  mutate(across(starts_with('nestedvals'), ~replace_na(., 0)))

Based on:

It just seems to take some time.

dromano · April 19, 2024, 11:08pm

Lists are be created by pivot_wider() when deleting the values_from column results in duplicate rows:

library(tidyverse)  
tibble(
  ltr = rep(letters[1:2], each = 2), 
  num = 1:4
) -> small

small
#> # A tibble: 4 × 2
#>   ltr     num
#>   <chr> <int>
#> 1 a         1
#> 2 a         2
#> 3 b         3
#> 4 b         4

small |> 
  pivot_wider(names_from = ltr, values_from = num)
#> Warning: Values from `num` are not uniquely identified; output will contain list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(ltr)) |>
#>   dplyr::filter(n > 1L)
#> # A tibble: 1 × 2
#>   a         b        
#>   <list>    <list>   
#> 1 <int [2]> <int [2]>

^{Created on 2024-04-19 with reprex v2.0.2}

which means there must be ID and col (index) combinations that repeat in your full data.

A fix to @FJCC's solution is to add unique identifiers for each row before running pivot_wider():

df |> 
  unnest(nestedvals) |> 
  rowwise() |> 
  mutate(
    name = nestedvals |> pluck(1), 
    value =  nestedvals |> pluck(2) 
  ) |> 
  ungroup() |> 
  mutate(row = row_number()) |> 
  pivot_wider(names_prefix = 'col', values_fill = 0) |> 
  group_by(id) |> 
  summarise( 
    across(nestedvals, list),
    across(contains('col'), sum)
  ) |> 
  select(contains(c('i', 'n', 1:10)))

TPDeRamus · April 19, 2024, 11:23pm

Is there a logic behind the:
select(contains(c('i', 'n', 1:10)))
Or is that arbitrary?

I ask because I'd like to know if I should give it a longer range from data that can range from 1-1500 unique values for index.

dromano · April 20, 2024, 12:36am

Yes, the contains() function allows you to match the names of the columns to strings: "i" to id, "n" to nestedvals, and 1:10 to the numbered columns, and the columns appear in the order specified.

An alternative is to use matches() instead, which requires regular expressions:

matches('i|n|[0-9]') # matches 'i' or 'n' or the digits 0-9

but then you may lose control of the of order in which the numbered columns appear.

system · May 11, 2024, 12:37am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.