JSON Data and bind_rows() / unnest() - Bug Report?

seadoo · September 18, 2024, 10:30pm

Here is the link to my rmarkdown that explains this whole thing: RPubs - bind_rows() and JSON data

Topic:
When confronted with NULL values or list() values present in JSON data converted to a list of lists by httr2, bind_rows() fails resulting in empty tibbles and dropped rows/columns with no warnings or error messages.

This is a deeper dive into a similar (unanswered) post here:

How do I contend with NULLs in nested JSON when converting to df using fromJSON, bind_rows, and map? tidyverse

Hi. I am trying to convert some JSON into a data frame but am having trouble with null values downstructure, which show up as a pairlist of length 0 when coming through jsonlite. Jeroen and @jennybryan had some back and forth here, though I wasn't sure if it had been resolved. For now, I'm using this solution but was wondering if there was a more purrrlicious way to do it. Reprex below is supposed to be minimal, but normally I would get my end result by piping the map result into bind_rows(.id = ticker). library(tidyverse) #> Warning: package 'stringr' was built under R version 3.4.4 library(jsonlite) #> #> Attaching package: 'jsonlite' #> The following object is masked from 'package:purr…

This example data object is based on my actual use case, retrieved using httr2 and resp_body_json().

Sample data:

some_list <- list(
  list(
    id = 001,
    name = "bob",
    age = 23,
    country = NULL,
    items = list("apple", "banana", "pear")
  ),
  list(
    id = 002,
    name = "sam",
    age = NULL,
    country = NULL,
    items = list()
  ),
  list(
    id = 003,
    name = "joe",
    age = NULL,
    country = NULL,
    items = list()
  )
)

Desired (Tidy) Result:

# A tibble: 5 × 5
     id name    age country items 
  <dbl> <chr> <dbl> <lgl>   <chr> 
1     1 bob      23 NA      apple 
2     1 bob      23 NA      banana
3     1 bob      23 NA      pear  
4     2 sam      NA NA      NA    
5     3 joe      NA NA      NA

Code required to achieve result - in my opinion this is excessive and should not be necessary to prevent dropped rows and columns:

some_list |> 
  purrr::map( \(sub) purrr::map(sub, \(i) if (length(i) == 0) NA else i) ) |> 
  dplyr::bind_rows() |> 
  dplyr::mutate(items = purrr::map(items, \(i) if (is.null(i)) NA else i)) |> 
  tidyr::unnest(items)

Summary copy/pasted from my rmarkdown post:

bind_rows should not return an empty tibble when a list() column is present
bind_rows should not drop columns where all are NULL
bind_rows should not replace NA with NULL
unnest() should not drop rows where val is NULL in unnest(val)
Most importantly, if none of these behaviors can change, a warning message should be posted where columns / rows are dropped as a result.

At the very least, allowing a fill argument to bind_rows() would bring its functionality very close to rbindlist() for working with JSON data that have been converted to R list structures.

How are you guys handling JSON data within the tidyverse? Are these expectations unreasonable?

I suppose the argument could be made that the resp_body_json function from httr2 should have an option to replace NULL with NA, knowing that JSON allows for NULL (empty), but not NA (missing).

michaelbgarcia · September 19, 2024, 1:20am

Nice write up and I agree with your takes as I have experienced the same frustrations. My previous solutions are specific to the datasets I encountered so I don't have a one-size-fits-all approach. In some cases when I can control the source JSON, I'm better off modifying how it gets written just to not have to deal with any read problems. In any case, here is a different approach for your data:

some_list <- list(
  list(
    id = 001,
    name = "bob",
    age = 23,
    country = NULL,
    items = list("apple", "banana", "pear")
  ),
  list(
    id = 002,
    name = "sam",
    age = NULL,
    country = NULL,
    items = list()
  ),
  list(
    id = 003,
    name = "joe",
    age = NULL,
    country = NULL,
    items = list()
  )
)

purrr::map_depth(.x = some_list,
                    .depth = 2, 
                    .f = \(x) if(is.null(x)) NA else unlist(x)) |> 
  dplyr::bind_rows()
#> # A tibble: 5 × 5
#>      id name    age country items 
#>   <dbl> <chr> <dbl> <lgl>   <chr> 
#> 1     1 bob      23 NA      apple 
#> 2     1 bob      23 NA      banana
#> 3     1 bob      23 NA      pear  
#> 4     2 sam      NA NA      <NA>  
#> 5     3 joe      NA NA      <NA>

^{Created on 2024-09-18 with reprex v2.1.1}

nirgrahamuk · September 19, 2024, 5:27pm

seadoo:

some_list |> 
  purrr::map( \(sub) purrr::map(sub, \(i) if (length(i) == 0) NA else i) ) |> 
  dplyr::bind_rows() |> 
  dplyr::mutate(items = purrr::map(items, \(i) if (is.null(i)) NA else i)) |> 
  tidyr::unnest(items)

there is a loose repetition in what you have in so far the is.null mapper, could be swapped out for the length(i) mapper, and therefore you can look to make your own function that maps over values, and converts length 0 entries (be they null , or not null) to NA. This would help your code readability, and be repeatable for you, if you make your own helper function. i.e.

null2na <- function(sub){
 purrr::map(sub, \(i) if (length(i) == 0) NA else i) 
}

some_list |> 
  purrr::map(null2na ) |> 
  dplyr::bind_rows() |> 
  dplyr::mutate(items = null2na(items)) |>
  unnest(items)

on further thought unnest has keep_empty argument so

some_list |> 
  purrr::map(null2na ) |> 
  dplyr::bind_rows() |> 
  unnest(items,keep_empty = TRUE)

seadoo · September 19, 2024, 8:42pm

Thank you both for these replies! I like each of these suggestions, and I especially feel silly for not noticing the keep_empty argument for unnest. To incorporate them both, we could do:

some_list |>
  purrr::map_depth(2, \(.x) if(length(.x) == 0) NA else .x) |>
  dplyr::bind_rows() |>
  tidyr::unnest(items, keep_empty = TRUE)

which is a whole lot more concise than my solution in the initial post. This brings us down to near-parity to data.table::rbindlist():

some_list |>
  data.table::rbindlist() |>
  tidyr::unnest(items, keep_empty = TRUE)

I am happy to acknowledge that the issue with unnest was user-error, but I still think that bind_rows should at minimum warn when the NULL-ish values like list() break the function.

With some further testing, bind_rows does correctly handle non-list columns where some are NULL by replacing them with NA by default, but something about list-columns short-circuits its behavior.

Without handling list() values (either length or purrr::is_empty) we just unceremoniously drop rows. You can test this further by changing the items entry in either the sam or joe entry of some_list to NULL rather than list(). In that instance the changed row does make it through bind_rows followed by our corrected unnest(keep_empty = TRUE) call.

I would also argue that bind_rows dropping columns by default is inappropriate, especially since there is no argument to negate that behavior like in unnest. I am guessing this is because a common paradigm for removing columns has been df$col <- NULL so that behavior at least is somewhat known, if not accepted or even good.

Also, not to be too nit-picky, but even the documentation for bind_rows states the following:

This is similar to do.call(rbind, dfs), but the output will contain all columns that appear in any of the inputs.

seadoo · September 20, 2024, 3:18pm

As an update - I have been digging through the source for bind_rows() and it looks like the culprit is as_tibble which is called in bind_rows. I will do more digging and update with my findings as to what/where this is coming from.

Relevant section in bind-rows.R (48:51):

  if (is_named(dots) && !all(map_lgl(dots, dataframe_ish))) {
    # This is hit by map_dfr() so we can't easily deprecate
    return(as_tibble(dots))
  }

I'll keep updating this post as I go through it, but I have gone further down the rabbit hole and here is where we stand currently. as_tibble will dispatch to as_tibble.list, which is below (and eventually calls lst_to_tibble:

as_tibble.list <- function(
    x, 
    validate = NULL, 
    ..., 
    .rows = NULL,
    .name_repair = c("check_unique", "unique", "universal", "minimal")
) {
  if (!is.null(validate)) {
    deprecate_stop("2.0.0", "tibble::as_tibble(validate = )", "as_tibble(.name_repair =)")
  }
  # this is the important line, col_lengths is the kicker
  lst_to_tibble(x, .rows, .name_repair, col_lengths(x))
}

lst_to_tibble <- function(x, .rows, .name_repair, lengths = NULL, call = caller_env()) {
  x <- unclass(x)
  x <- set_repaired_names(x, repair_hint = TRUE, .name_repair, call = call)
  x <- check_valid_cols(x, call = call)
  x <- recycle_columns(x, .rows, lengths)
  x
}

In simple print debugging, the lengths argument to lst_to_tibble() seems to be the deciding factor. That lengths argument comes from the function col_lengths which is used at the end of as_tibble.list. I've tested this below:

test <- list(
  id = 003,
  name = "joe",
  items = list()
)
 lst_to_tibble(test, .rows = NULL, lengths = 1)

# Output:
#   id name items
# 1  3  joe  <NA>

To go further, col_lengths is defined below:

col_lengths <- function(x) {
  map_int(x, vctrs::vec_size)
}

col_lengths(test)
# Output:
#  id name items
#   1    1     0

This in turn gets filtered by guess_nrow which is used in recycle_columns

guess_nrow <- function(lengths, .rows) {
  if (!is.null(.rows)) {
    return(.rows)
  }
  if (is_empty(lengths)) {
    return(0)
  }

  nontrivial_lengths <- lengths[lengths != 1L]
  if (is_empty(nontrivial_lengths)) {
    return(1)
  }

  max(nontrivial_lengths)
}

And it's guess_nrow that ends up returning 0 as the final nrow value that defines the dimensions of the resultant tibble.

So ultimately it's vctrs::vec_size that takes down our JSON data containing list(). I found a semi-recent github issue and it seems like it's desired behavior. So in my opinion the best solution is to improve httr2::resp_body_json (or add a fill argument to bind_rows) to work within the confines of the tibble and the tidyverse.

seadoo · September 20, 2024, 8:49pm

OK, final reply. The issue was with httr2.

Opened a github issue but it looks unlikely to be fixed.

json <- '{
  "results": [
    {
      "id": 1,
      "name": "bob",
      "age": 23,
      "country": null,
      "items": ["apple", "banana", "pear"]
    },
    {
      "id": 2,
      "name": "sam",
      "age": null,
      "country": null,
      "items": []
    },
    {
      "id": 3,
      "name": "joe",
      "age": null,
      "country": null,
      "items": []
    }
  ]
}'

# using default params
jsonlite::fromJSON(json)
#> $results
#>   id name age country               items
#> 1  1  bob  23      NA apple, banana, pear
#> 2  2  sam  NA      NA                    
#> 3  3  joe  NA      NA

# using httr2 params
jsonlite::fromJSON(json, simplifyVector = FALSE)
#> $results
#> $results[[1]]
#> $results[[1]]$id
#> [1] 1
#> 
#> $results[[1]]$name
#> [1] "bob"
#> 
#> $results[[1]]$age
#> [1] 23
#> 
#> $results[[1]]$country
#> NULL
#> 
#> $results[[1]]$items
#> $results[[1]]$items[[1]]
#> [1] "apple"
#> 
#> $results[[1]]$items[[2]]
#> [1] "banana"
#> 
#> $results[[1]]$items[[3]]
#> [1] "pear"
#> 
#> 
#> 
#> $results[[2]]
#> $results[[2]]$id
#> [1] 2
#> 
#> $results[[2]]$name
#> [1] "sam"
#> 
#> $results[[2]]$age
#> NULL
#> 
#> $results[[2]]$country
#> NULL
#> 
#> $results[[2]]$items
#> list()
#> 
#> 
#> $results[[3]]
#> $results[[3]]$id
#> [1] 3
#> 
#> $results[[3]]$name
#> [1] "joe"
#> 
#> $results[[3]]$age
#> NULL
#> 
#> $results[[3]]$country
#> NULL
#> 
#> $results[[3]]$items
#> list()

Notice that we get the same list() and NULL values instead of NA like before. The solution then is to not use httr2::resp_body_json and instead just use jsonlite::fromJSON directly.

system · September 27, 2024, 8:50pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.