Assigning Variables to columns based on a nested list using dplyr/tidyverse

TPDeRamus · April 18, 2024, 7:57pm

Hi experts.

I have a tibble with a column containing a nested list (<list<list<double>>> data type to be specific).

It looks something like the following (but in R/Arrow format):

ID	nestedvals
001	[[1]](1,0.1) [[2]](2,0.2) [[3]](3,0.3) [[4]](4,0.4) [[5]](5,0.5)
002	[[1]](1,0.1) [[2]](2,0.2) [[3]](3,0.3) [[4]](4,0.4)
003	[[1]](1,0.1) [[2]](2,0.2) [[3]](3,0.3)
004	[[1]](1,0.1) [[2]](2,0.2)
005	[[1]](1,0.1)

Where if I call row 1 of nestedvals I get:

tibble$nestedvals[1]
<list<list<double>>[1]>
[[1]]
<list<double>[5]>
[[1]]
[1] 1   0.1

[[2]]
[1] 2   0.2

[[3]]
[1] 3   0.3

[[4]]
[1] 4   0.4

[[5]]
[1] 5   0.5

Basically, for the nestedvals column, there is a list containing a list of pairs of doubles, with the first indicating a specific index (e.g. 5), and a certain value (e.g. 0.5).

What I would like to do is generate set of zero-filled columns based on the range of unique indexes of each nested list. e.g.:
col_1, col_2, col_3, col_4, col_5

And then replace each 0 with the value (second number in the nested list), based on the index (first number in each nested list), for each row of the tibble.

I believed the best way to do this involves unlisting the variables and making separate columns with the lists of indexes and lists of values of interest, both so I can find the maximum for name generation in the former and for assignment between to two.

To accomplish this, I wrote a function to split each nested list:

  nestsplit <- function(x, y) {
    unlist(lapply(x, `[[`, y))
  }

And then generate unique columns with the column names (by index) and values of interest to append to the tibble:

tibble <- tibble |> rowwise() |> mutate(index_names = list(paste0("col_",as.character(nestsplit(nestedvals, 1)))),index_values = list(nestsplit(nestedvals, 2)))

But I would like to see if there is an efficient, row-wise, tidyverse/dplyr-based solution to assign the values in index_values to index-based column names using the information in the index_names variable, rather than writing a loop to assign each of them by row.

So that an output like this:

ID	nestedvals	col_1	col_2	col_3	col_4	col_5
001	<Nested list of 5 pairs of values>	0	0	0	0	0
002	<Nested list of 4 pairs of values>	0	0	0	0	0
003	<Nested list of 3 pairs of values>	0	0	0	0	0
004	<Nested list of 2 pairs of values>	0	0	0	0	0
005	<Nested list of 1 pair of values>	0	0	0	0	0

Instead looks like the following:

ID	nestedvals	col_1	col_2	col_3	col_4	col_5
001	<Nested list of 5 pairs of values>	0.1	0.2	0.3	0.4	0.5
002	<Nested list of 4 pairs of values>	0.1	0.2	0.3	0.4	0
003	<Nested list of 3 pairs of values>	0.1	0.2	0.3	0	0
004	<Nested list of 2 pairs of values>	0.1	0.2	0	0	0
005	<Nested list of 1 pair of values>	0.1	0	0	0	0

I would love to give an example to simulate the exact nature of the data, but I'm unfortunately not sure how to recreate this data class/type for an example:

> typeof(tibble$nestedvals)
[1] "list"
> class(tibble$nestedvals)
[1] "arrow_list"    "vctrs_list_of" "vctrs_vctr"    "list"

The closest I have ever been able to get is with:

tibble(ID = c("001", "002", "003", "004", "005"), nestedvals = list(list(c(1,0.1),c(2,0.2),c(3,0.3),c(4,0.4),c(5,0.5)),list(c(1,0.1),c(2,0.2),c(3,0.3),c(4,0.4)),list(c(1,0.1),c(2,0.2),c(3,0.3)),list(c(1,0.1),c(2,0.2)),list(c(1,0.1))))

Which gives a list datatype instead of <list<list<double>>>

FJCC · April 18, 2024, 10:17pm

The output of dput() should provide code to recreate the data. Please run

dput(tibble)

or

dput(head(tibble))

and post the output.

TPDeRamus · April 18, 2024, 11:11pm

Thanks for that life-changing piece of information.

Attached:

structure(list(ID=c(001,002,003,004,005),nestedvals=structure(list(structure(list(c(1,0.1),c(2,0.2),c(3,0.3),c(4,0.4),c(5,0.5)),class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),ptype=numeric(0)),structure(list(c(1,0.1),c(2,0.2),c(3,0.3),c(4,0.4)),class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),ptype=numeric(0)),structure(list(c(1,0.1),c(2,0.2),c(3,0.3)),class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),ptype=numeric(0)),structure(list(c(1,0.1),c(2,0.2)),class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),ptype=numeric(0)),structure(list(c(1,0.1)),class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),ptype=numeric(0))),ptype=structure(list(),class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),ptype=numeric(0)),class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"))),row.names=c(NA,-5L),class=c("tbl_df","tbl","data.frame"))

dromano · April 19, 2024, 1:05am

And could you edit the post with your dput() output by making it copyable?

FJCC · April 19, 2024, 1:15am

I think this gets you what you want. I confess that I got the mutate function via trial and error and I couldn't predict what the unnest() would produce.

df <- structure(list(ID=c(001,002,003,004,005),
                     nestedvals=structure(list(structure(list(c(1,0.1),c(2,0.2),c(3,0.3),c(4,0.4),c(5,0.5)),
                                                         class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),
                                                         ptype=numeric(0)),
                                               structure(list(c(1,0.1),c(2,0.2),c(3,0.3),c(4,0.4)),
                                                         class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),
                                                         ptype=numeric(0)),
                                               structure(list(c(1,0.1),c(2,0.2),c(3,0.3)),
                                                         class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),
                                                         ptype=numeric(0)),
                                               structure(list(c(1,0.1),c(2,0.2)),
                                                         class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),
                                                         ptype=numeric(0)),
                                               structure(list(c(1,0.1)),
                                                         class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),
                                                         ptype=numeric(0))),
                                          ptype=structure(list(),class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"),
                                                          ptype=numeric(0)),class=c("arrow_list","vctrs_list_of","vctrs_vctr","list"))),
                row.names=c(NA,-5L),class=c("tbl_df","tbl","data.frame"))

library(tidyverse)

df2 <- unnest(df, cols = nestedvals) |> 
  rowwise() |> 
  mutate(col = nestedvals[1],
         Val = nestedvals[2]) |> 
  select(-nestedvals) |> 
  pivot_wider(names_from = col, values_from = Val, names_prefix = "Col_")
df2
#> # A tibble: 5 × 6
#>      ID Col_1 Col_2 Col_3 Col_4 Col_5
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1   0.1   0.2   0.3   0.4   0.5
#> 2     2   0.1   0.2   0.3   0.4  NA  
#> 3     3   0.1   0.2   0.3  NA    NA  
#> 4     4   0.1   0.2  NA    NA    NA  
#> 5     5   0.1  NA    NA    NA    NA

^{Created on 2024-04-18 with reprex v2.0.2}

dromano · April 19, 2024, 4:38am

I'm curious: Could I ask, if you don't mind, why you'd like to convert your original table into your desired table? Is it simply to be able to view the contents of the list values more easily, or is there another reason?

TPDeRamus · April 19, 2024, 11:46am

Conversion into weighted and dummy variables from some extracted features.

The read-in is originally in pyarrow/python format exported as a parquet and read with the arrow package in R.

I actually thought it would be easier to generate column names from the indices and assign the values with something like contains but it looks like that's not the case.

dromano · April 19, 2024, 1:41pm

I guess my question, which in retrospect is not very clear, is more about why you'd like to convert the lists to a table since list columns can be very convenient and compact ways to store the data. Is it that you just wanted a way to see what is in the lists?

TPDeRamus · April 19, 2024, 1:52pm

So I apologize that I was not able to provide the full data (it contains HIPPA protected info) and that what I provided was a very short example, but when I attempted this solution it threw the following:

Values from `Val` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} |>
  dplyr::summarise(n = dplyr::n(), .by = c(ID,nestedvals, col)) |>
  dplyr::filter(n > 1L)

I assume because the index_names can occur more than once in the full data.

TPDeRamus · April 19, 2024, 1:53pm

This needs to eventually be converted into a model of some kind and the table reflects the transition to this better.

dromano · April 19, 2024, 2:18pm

Thanks. And is it important that the column indices be extracted from the values in nestedvals? Or does the table structure of your actual data mirror the structure of your toy table (in which case only the second double values from each vector of length 2 are needed)?

TPDeRamus · April 19, 2024, 2:32pm

It does.

Both the index (for a variable name) and the value itself are necessary for the output.

TPDeRamus · April 19, 2024, 4:04pm

This seems to be working (from a Stack Overflow thread I also tried):

But it is taking an exceptionally long time:

  ##Stack Answer 2
  stack2 <- tibble |>
    unnest_wider(nestedvals,
                 names_sep = '_',
                 transform = ~ .x[2]) |>
    mutate(across(starts_with('nestedvals'), ~ replace_na(., 0)))

I have discovered this may be due to some data points having missing values:

<list<list<double>>[1]>
[[1]]
<list<double>[0]>

Causing it to hang.

Other commands using map and purr based workflow will just throw an error on that line.

dromano · April 19, 2024, 5:19pm

So to confirm I understand, it sounds like you have a table where the length of a list value in the nestedvalscolumn is independent of the list value's row order and ID values, as well as of the indices contained in the list value, is that right?

For example, does your table look more like this, without the regularity of your toy table?

code for building a toy table, save as 'before'

library(tidyverse)

set.seed(123)
tibble(id = 1:4) |> 
  mutate(size = sample(1:6, 4, replace = T)) |> 
  rowwise() |> 
  mutate(indices = sample(1:6, size) |> list()) |> 
  unnest(indices) |> 
  mutate(val = runif(n())) |>
  mutate(row = row_number()) |> 
  pivot_longer(indices:val) |> 
  group_by(id, row) |> 
  summarise(index_val = value |> list()) |> 
  select(!row) |> 
  group_by(id) |> 
  summarise(nestedvals = index_val |> list()) -> before
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.

before
#> # A tibble: 4 × 2
#>      id nestedvals
#>   <int> <list>    
#> 1     1 <list [3]>
#> 2     2 <list [6]>
#> 3     3 <list [3]>
#> 4     4 <list [2]>

And here are its contents:

before |> 
  pull()
#> [[1]]
#> [[1]][[1]]
#> [1] 2.0000000 0.9942698
#> 
#> [[1]][[2]]
#> [1] 3.0000000 0.6557058
#> 
#> [[1]][[3]]
#> [1] 1.0000000 0.7085305
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> [1] 4.000000 0.544066
#> 
#> [[2]][[2]]
#> [1] 1.000000 0.594142
#> 
#> [[2]][[3]]
#> [1] 2.0000000 0.2891597
#> 
#> [[2]][[4]]
#> [1] 3.0000000 0.1471136
#> 
#> [[2]][[5]]
#> [1] 6.0000000 0.9630242
#> 
#> [[2]][[6]]
#> [1] 5.000000 0.902299
#> 
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] 3.0000000 0.6907053
#> 
#> [[3]][[2]]
#> [1] 6.0000000 0.7954674
#> 
#> [[3]][[3]]
#> [1] 1.00000000 0.02461368
#> 
#> 
#> [[4]]
#> [[4]][[1]]
#> [1] 4.000000 0.477796
#> 
#> [[4]][[2]]
#> [1] 1.0000000 0.7584595

^{Created on 2024-04-19 with reprex v2.0.2}

TPDeRamus · April 19, 2024, 5:35pm

Correct.

Though it also appears we have instances where the values of said list are empty:

<list<list<double>>[1]>
[[1]]
<list<double>[0]>

And neither of the following ifelse solutions I have tried seem to be able to detect when this is the case:
tibble |> mutate(emptyval = ifelse((list(NULL) %in% (nestedvals)),1,0))
tibble |> mutate(emptyval = ifelse(is.null((unlist(nestedvals)),1,0)))

So that I can filter them out prior to applying any sort of workflow.

dromano · April 19, 2024, 5:37pm

Thanks, that is helpful to know.

dromano · April 19, 2024, 5:55pm

Do you mean that, say, ID 001 could occur more than once in the full data?

TPDeRamus · April 19, 2024, 6:05pm

It was in reference to the first value in each double of each list actually.

dromano · April 19, 2024, 6:16pm

Are you saying it's possible for and ID and index name combination to occur in multiple rows in the full data? Because otherwise, I believe @FJCC's solution should work.

TPDeRamus · April 19, 2024, 7:56pm

That is unfortunately possible now that I think about it.

Index anyway, value not so much.