[unnest/unpack] Completely Flatten Json Data Queried from API

TPDeRamus · January 28, 2025, 11:44pm

Hi Posit Users.

I have a list of NPI numbers I'm querying from an API that pulls provider information:
https://npiregistry.cms.hhs.gov/api-page

I use workflow something like this for each pull:

library(httr)
library(jsonlite)
library(tidyverse)

out = fromJSON(rawToChar(GET("https://npiregistry.cms.hhs.gov/api/?version=2.1", query = list(number = 1992975965))$content), flatten = TRUE, simplifyDataFrame = TRUE)

Which generates a list output of an integer(result_count) and a dataframe(results), the latter of which being what I need.

However, this dataframe has quite a few list variables:

sapply(out$results,class)
         created_epoch       enumeration_type     last_updated_epoch                 number              addresses 
           "character"            "character"            "character"            "character"                 "list" 
     practiceLocations             taxonomies            identifiers              endpoints            other_names 
                "list"                 "list"                 "list"                 "list"                 "list" 
      basic.first_name        basic.last_name      basic.middle_name       basic.credential  basic.sole_proprietor 
           "character"            "character"            "character"            "character"            "character" 
          basic.gender basic.enumeration_date     basic.last_updated           basic.status      basic.name_prefix 
           "character"            "character"            "character"            "character"            "character" 
     basic.name_suffix 
           "character"

Which I would like to flatten as much as possible.

The following does not work due to what I assume is name recycling:

out$results |> unnest_longer(where(is.list))
Error in `unnest_longer()`:
! In row 1, can't recycle input of size 2 to size 0.
Run `rlang::last_trace()` to see where the error occurred.

pivot_longer works and I don't mind the extra rows, but it has the unfortunate side effect of creating a variable called value that is also a list:

sapply(out$results |> pivot_longer(where(is.list)),class)
         created_epoch       enumeration_type     last_updated_epoch                 number       basic.first_name 
           "character"            "character"            "character"            "character"            "character" 
       basic.last_name      basic.middle_name       basic.credential  basic.sole_proprietor           basic.gender 
           "character"            "character"            "character"            "character"            "character" 
basic.enumeration_date     basic.last_updated           basic.status      basic.name_prefix      basic.name_suffix 
           "character"            "character"            "character"            "character"            "character" 
                  name                  value 
           "character"                 "list"

With this particular variable completely unable to be flattened despite my best efforts:

> out$results |> pivot_longer(where(is.list)) |> unnest_longer(where(is.list))
Error in `col_to_long()`:
! Can't combine `..1$value` <data.frame> and `..3$value` <list>.
Run `rlang::last_trace()` to see where the error occurred.

> out$results |> pivot_longer(where(is.list)) |> pivot_longer(where(is.list))
Error in `pivot_longer()`:
! Names must be unique.
✖ These names are duplicated:
  * "name" at locations 16 and 17.
ℹ Use argument `names_repair` to specify repair strategy.
Run `rlang::last_trace()` to see where the error occurred.

> out$results |> pivot_longer(where(is.list)) |> unnest(where(is.list))
Error in `list_unchop()`:
! Can't combine `x[[1]]` <data.frame> and `x[[2]]` <list>.
Run `rlang::last_trace()` to see where the error occurred.

And while unpack will run, the variable value remains a list and seems to behave as if nothing happened:

> sapply(out$results |> pivot_longer(where(is.list)) |> unpack(where(is.list)),class)
         created_epoch       enumeration_type     last_updated_epoch                 number       basic.first_name 
           "character"            "character"            "character"            "character"            "character" 
       basic.last_name      basic.middle_name       basic.credential  basic.sole_proprietor           basic.gender 
           "character"            "character"            "character"            "character"            "character" 
basic.enumeration_date     basic.last_updated           basic.status      basic.name_prefix      basic.name_suffix 
           "character"            "character"            "character"            "character"            "character" 
                  name                  value 
           "character"                 "list"

Would anyone be able to tell me what I am doing incorrectly with this call?

Thank you in advance!

margusl · January 29, 2025, 4:20am

If you take a look at their CSV (i.e. flat file) - JSON mapping - NPPES NPI Registry - it's all just spread wider into individual columns, single row per record. As dataframe simplification in jsonlite rather increases row count, it might be easier to just disable all simplifications and handle rectangling yourself.

For a deeper structure it might make sense to approach this recursively, but here we can just call unnest_wider() few times, taking a glimpse after each step to see how column count in resulting frame increases:

library(httr)
library(tidyverse)

GET("https://npiregistry.cms.hhs.gov/api/?version=2.1", query = list(number = 1992975965)) |> 
  # additional arguments are passed to parser, in this case to jsonlite::fromJSON()
  content(as = "parsed", simplifyVector = FALSE) |> 
  pluck("results") |> 
  tibble(results = _) |> 
  unnest_wider(results) |> 
  glimpse() |> 
  unnest_wider(where(is.list), names_sep = "_") |> 
  glimpse() |> 
  unnest_wider(where(is.list), names_sep = "_") |> 
  glimpse()
#> Rows: 1
#> Columns: 11
#> $ created_epoch      <chr> "1204815636000"
#> $ enumeration_type   <chr> "NPI-1"
#> $ last_updated_epoch <chr> "1489606036000"
#> $ number             <chr> "1992975965"
#> $ addresses          <list> [["US", "United States", "MAILING", "DOM", "409 FOR…
#> $ practiceLocations  <lgl> NA
#> $ basic              <list> ["ELIZABETH", "BARR", "ANN", "LICSW", "YES", "F", "…
#> $ taxonomies         <list> [["101YA0400X", "", "Counselor, Addiction (Substanc…
#> $ identifiers        <lgl> NA
#> $ endpoints          <lgl> NA
#> $ other_names        <list> [["Former Name", "1", "LICSW", "ELIZABETH", "SLOTE"…

#> Rows: 1
#> Columns: 23
#> $ created_epoch          <chr> "1204815636000"
#> $ enumeration_type       <chr> "NPI-1"
#> $ last_updated_epoch     <chr> "1489606036000"
#> $ number                 <chr> "1992975965"
#> $ addresses_1            <list> ["US", "United States", "MAILING", "DOM", "409 …
#> $ addresses_2            <list> ["US", "United States", "LOCATION", "DOM", "409…
#> $ practiceLocations      <lgl> NA
#> $ basic_first_name       <chr> "ELIZABETH"
#> $ basic_last_name        <chr> "BARR"
#> $ basic_middle_name      <chr> "ANN"
#> $ basic_credential       <chr> "LICSW"
#> $ basic_sole_proprietor  <chr> "YES"
#> $ basic_gender           <chr> "F"
#> $ basic_enumeration_date <chr> "2008-03-06"
#> $ basic_last_updated     <chr> "2017-03-15"
#> $ basic_status           <chr> "A"
#> $ basic_name_prefix      <chr> "Mrs."
#> $ basic_name_suffix      <chr> "--"
#> $ taxonomies_1           <list> ["101YA0400X", "", "Counselor, Addiction (Subst…
#> $ taxonomies_2           <list> ["1041C0700X", "", "Social Worker, Clinical", "…
#> $ identifiers            <lgl> NA
#> $ endpoints              <lgl> NA
#> $ other_names_1          <list> ["Former Name", "1", "LICSW", "ELIZABETH", "SL…

#> Rows: 1
#> Columns: 56
#> $ created_epoch                <chr> "1204815636000"
#> $ enumeration_type             <chr> "NPI-1"
#> $ last_updated_epoch           <chr> "1489606036000"
#> $ number                       <chr> "1992975965"
#> $ addresses_1_country_code     <chr> "US"
#> $ addresses_1_country_name     <chr> "United States"
#> $ addresses_1_address_purpose  <chr> "MAILING"
#> $ addresses_1_address_type     <chr> "DOM"
#> $ addresses_1_address_1        <chr> "409 FORTUNE BLVD"
#> $ addresses_1_address_2        <chr> "SUITE 101"
#> $ addresses_1_city             <chr> "MILFORD"
#> $ addresses_1_state            <chr> "MA"
#> $ addresses_1_postal_code      <chr> "017571741"
#> $ addresses_1_telephone_number <chr> "508-259-9282"
#> $ addresses_2_country_code     <chr> "US"
#> $ addresses_2_country_name     <chr> "United States"
#> $ addresses_2_address_purpose  <chr> "LOCATION"
#> $ addresses_2_address_type     <chr> "DOM"
#> $ addresses_2_address_1        <chr> "409 FORTUNE BLVD"
#> $ addresses_2_address_2        <chr> "SUITE 101"
#> $ addresses_2_city             <chr> "MILFORD"
#> $ addresses_2_state            <chr> "MA"
#> $ addresses_2_postal_code      <chr> "017571741"
#> $ addresses_2_telephone_number <chr> "508-259-9282"
#> $ practiceLocations            <lgl> NA
#> $ basic_first_name             <chr> "ELIZABETH"
#> $ basic_last_name              <chr> "BARR"
#> $ basic_middle_name            <chr> "ANN"
#> $ basic_credential             <chr> "LICSW"
#> $ basic_sole_proprietor        <chr> "YES"
#> $ basic_gender                 <chr> "F"
#> $ basic_enumeration_date       <chr> "2008-03-06"
#> $ basic_last_updated           <chr> "2017-03-15"
#> $ basic_status                 <chr> "A"
#> $ basic_name_prefix            <chr> "Mrs."
#> $ basic_name_suffix            <chr> "--"
#> $ taxonomies_1_code            <chr> "101YA0400X"
#> $ taxonomies_1_taxonomy_group  <chr> ""
#> $ taxonomies_1_desc            <chr> "Counselor, Addiction (Substance Use Diso…
#> $ taxonomies_1_state           <chr> "MA"
#> $ taxonomies_1_license         <chr> "114828"
#> $ taxonomies_1_primary         <lgl> FALSE
#> $ taxonomies_2_code            <chr> "1041C0700X"
#> $ taxonomies_2_taxonomy_group  <chr> ""
#> $ taxonomies_2_desc            <chr> "Social Worker, Clinical"
#> $ taxonomies_2_state           <chr> "MA"
#> $ taxonomies_2_license         <chr> "114828"
#> $ taxonomies_2_primary         <lgl> TRUE
#> $ identifiers                  <lgl> NA
#> $ endpoints                    <lgl> NA
#> $ other_names_1_type           <chr> "Former Name"
#> $ other_names_1_code           <chr> "1"
#> $ other_names_1_credential     <chr> "LICSW"
#> $ other_names_1_first_name     <chr> "ELIZABETH"
#> $ other_names_1_last_name      <chr> "SLOTE"
#> $ other_names_1_middle_name    <chr> "ANN"

TPDeRamus · January 29, 2025, 5:50pm

This worked quite splendidly, though it does produce a lot of columns.

Thank you!

system · February 5, 2025, 5:51pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.