Hi! I've solved this problem, but just can't believe there isn't an easier way. So I figured I'd throw it up to the community!
Problem statement: This json contains a list of regions, and the regions/countries they contain, in hierarchy. I need the second entry (geographic hierarchy). I would like to come up with a list of all the regions that apply to a given country. Here's what I tried.
First attempt, to use jq (json query)... but recursion flattens the list and loses the hirearchy. Here's the code:
llibrary(jqr)
library(tidyverse)
jq_try <- url("https://unstats.un.org/SDGAPI/v1/sdg/GeoArea/Tree") %>%
jq(".[1] | .children")
jq_try %>% jq(".[].geoAreaName")
#> [
#> "Africa",
#> "Americas",
#> "Asia",
#> "Antarctica",
#> "Oceania",
#> "Europe"
#> ]
jq_try %>% jq("..|.geoAreaName?") %>% as.list %>% enframe %>% unnest(value)
#> # A tibble: 601 x 2
#> name value
#> <int> <chr>
#> 1 1 "\"Africa\""
#> 2 2 "\"Sub-Saharan Africa\""
#> 3 3 "\"Eastern Africa\""
#> 4 4 "\"British Indian Ocean Territory\""
#> 5 5 "null"
#> 6 6 "\"Burundi\""
#> 7 7 "null"
#> 8 8 "\"Comoros\""
#> 9 9 "null"
#> 10 10 "\"Djibouti\""
#> # ... with 591 more rows
There might be a way to optimize that jq query, but I'm more interested in hearing this community's take on the second approach, using tidyverse and network analysis! What I did was more or less manually dig into the json, renaming fields every step to prevent name collisions. I also manually reconstruct the edge network because I couldn't find any method that digests a full dataframe and infers structure from that. This actually works, and I'll convert it into a function and use this if I can't get anything better:
library(tidyverse)
library(igraph)
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:dplyr':
#>
#> as_data_frame, groups, union
#> The following objects are masked from 'package:purrr':
#>
#> compose, simplify
#> The following object is masked from 'package:tidyr':
#>
#> crossing
#> The following object is masked from 'package:tibble':
#>
#> as_data_frame
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
countries_raw <- jsonlite::fromJSON('https://unstats.un.org/SDGAPI/v1/sdg/GeoArea/Tree') %>%
## keep only continental categories:
slice(2) %>%
rename(geocode1 = geoAreaCode, geoName1 = geoAreaName, geoType1 = type) %>% unnest(children, keep_empty = TRUE) %>%
rename(geocode2 = geoAreaCode, geoName2 = geoAreaName, geoType2 = type) %>% unnest(children, keep_empty = TRUE) %>%
rename(geocode3 = geoAreaCode, geoName3 = geoAreaName, geoType3 = type) %>% unnest(children, keep_empty = TRUE) %>%
rename(geocode4 = geoAreaCode, geoName4 = geoAreaName, geoType4 = type) %>% unnest(children, keep_empty = TRUE) %>%
rename(geocode5 = geoAreaCode, geoName5 = geoAreaName, geoType5 = type) %>% unnest(children, keep_empty = TRUE) %>%
rename(geocode6 = geoAreaCode, geoName6 = geoAreaName, geoType6 = type) %>% unnest(children, keep_empty = TRUE)
## now create a structure that can track upwards. First, get rid of first
## level, and codes/types, then group by edgepairs
df.g <- countries_raw %>% select(-contains(c("1", "code", "type")))
df.g <- bind_rows(df.g[,1:2] %>% set_names("from", "to"),
df.g[,2:3] %>% set_names("from", "to"),
df.g[,3:4] %>% set_names("from", "to"),
df.g[,4:5] %>% set_names("from", "to"),
df.g[,5:6] %>% set_names("from", "to")) %>%
na.omit()
df.g <- df.g %>% igraph::graph_from_data_frame()
# library(visNetwork)
# visNetwork::visIgraph(df.g)
vertices <- bfs(graph = df.g, root = "Nigeria",neimode = "in", unreachable = FALSE)$order %>% as.numeric %>% discard(~is.na(.))
V(graph = df.g)[vertices]$name
#> [1] "Nigeria" "Western Africa" "Sub-Saharan Africa"
#> [4] "Africa"
I also thought I could benefit by using squash
to flatten the whole dataframe out... something like this:
library(tidyverse)
library(httr)
countries <- GET("https://unstats.un.org/SDGAPI/v1/sdg/GeoArea/Tree")
countries <- content(countries)
rlang::squash(countries[[2]]) %>% enframe %>%
mutate(value = purrr::map(value, as.character)) %>% unnest(value) %>%
mutate(region = ifelse(value == "Region", lag(value, 1), NA),
country = ifelse(value == "Country", lag(value, 1), NA)) %>%
select(region, country) %>% fill(region) %>% unique %>%
## grab higher levels
mutate(region2 = ifelse(is.na(lag(country, 1)) & is.na(lag(country, 2)), lag(region, 2), NA),
region3 = ifelse(is.na(lag(country, 1)) & is.na(lag(country, 2)), lag(region, 3), NA),
region4 = ifelse(is.na(lag(country, 1)) & is.na(lag(country, 2)) & is.na(lag(country, 3)), lag(region, 4), NA)) %>%
fill(region2, region3, region4) %>%
filter(!is.na(country))
#> Warning: Outer names are only allowed for unnamed scalar atomic inputs
#> # A tibble: 283 x 5
#> region country region2 region3 region4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Eastern Africa British Indian Ocean Territory Sub-Saharan Af~ Africa World ~
#> 2 Eastern Africa Burundi Sub-Saharan Af~ Africa World ~
#> 3 Eastern Africa Comoros Sub-Saharan Af~ Africa World ~
#> 4 Eastern Africa Djibouti Sub-Saharan Af~ Africa World ~
#> 5 Eastern Africa Eritrea Sub-Saharan Af~ Africa World ~
#> 6 Eastern Africa Ethiopia Sub-Saharan Af~ Africa World ~
#> 7 Eastern Africa French Southern Territories Sub-Saharan Af~ Africa World ~
#> 8 Eastern Africa Kenya Sub-Saharan Af~ Africa World ~
#> 9 Eastern Africa Madagascar Sub-Saharan Af~ Africa World ~
#> 10 Eastern Africa Malawi Sub-Saharan Af~ Africa World ~
#> # ... with 273 more rows
This kinda works, but the whole inferring hierarchies based on NA is super faffy. The above logic is wrong for example. I could work on it and find something that works, but it feels brittle and wrong.
What am I missing?