Converting Nested JSON to DataFrame in R?

akramh · December 19, 2023, 10:36am

I'm currently working on a project where I need to convert a nested JSON structure into a DataFrame using R. I'm facing some issues with the current approach, and I'd appreciate any help or guidance on how to properly handle this conversion.

Json file looks like this :

json_data <- '{
  "resourceType": "QuestionnaireResponse",
  "id": "example-questionnaireresponse",
  "questionnaire": "Questionnaire/example",
  "status": "completed",
  "subject": {
    "reference": "Patient/example"
  },
  "authored": "2023-12-19T12:00:00Z",
  "source": {
    "reference": "Patient/example"
  },
  "item": [
    {
      "linkId": "page1",
      "text": "Page 1",
      "item": [
        {
          "linkId": "1.1",
          "text": "Quel est votre nom?",
          "answer": [
            {
              "valueString": "Participant 1"
            },
            {
              "valueString": "Participant 2"
            }
          ]
        },
        {
          "linkId": "1.2",
          "text": "Quel est votre âge?",
          "answer": [
            {
              "valueInteger": 30
            },
            {
              "valueInteger": 25
            }
          ]
        },
        {
          "linkId": "1.3",
          "text": "Quel est votre lieu de résidence?",
          "answer": [
            {
              "valueString": "Ville A"
            },
            {
              "valueString": "Ville B"
            }
          ]
        }
      ]
    },
    {
      "linkId": "page2",
      "text": "Page 2",
      "item": [
        {
          "linkId": "2.2",
          "text": "Nombre dannées dexpérience?",
          "answer": [
            {
              "valueInteger": 5
            },
            {
              "valueInteger": 2
            }
          ]
        },
        {
          "linkId": "2.3",
          "text": "Êtes-vous heureux?",
          "answer": [
            {
              "valueBoolean": true
            },
            {
              "valueBoolean": false
            }
          ]
        }
      ]
    }
  ]
}'

The json file is the response of two partipants to a questionnaire! I'd like a dataframe where each row contains the responses of a participants while a column contains the responses to a specific question.

my approach :

df <- fromJSON(json_data) %>%
  as_tibble() %>%
  unnest_wider(item) %>%
  unnest_wider(item) %>%
  unnest_wider(answer) %>%
  select(linkId, text, valueString, valueInteger, valueBoolean)

which obviosuly doesnt work ..

How can I convert a FHIR json file structure to dataframe? an unnested dataframe at that.

AlexisW · December 20, 2023, 3:44pm

Let's give it a try. My impression is that all the data you care about is actually in item/item, the other fields are general information about the questionnaire that can be processed separately. So I would extract this part (there might be a better approach):

fromJSON(json_data)[["item"]][["item"]]
#> [[1]]
#>   linkId                              text                       answer
#> 1    1.1               Quel est votre nom? Participant 1, Participant 2
#> 2    1.2               Quel est votre âge?                       30, 25
#> 3    1.3 Quel est votre lieu de résidence?             Ville A, Ville B
#> 
#> [[2]]
#>   linkId                        text      answer
#> 1    2.2 Nombre dannées dexpérience?        5, 2
#> 2    2.3          Êtes-vous heureux? TRUE, FALSE

At this point, two problems: first, we have two data frames, corresponding to the 2 pages, we can use bind_rows() to group them while keeping the columns intact.

fromJSON(json_data)[["item"]][["item"]] |>
  bind_rows()
#>   linkId                              text                       answer
#> 1    1.1               Quel est votre nom? Participant 1, Participant 2
#> 2    1.2               Quel est votre âge?                       30, 25
#> 3    1.3 Quel est votre lieu de résidence?             Ville A, Ville B
#> 4    2.2       Nombre dannées dexpérience?                         5, 2
#> 5    2.3                Êtes-vous heureux?                  TRUE, FALSE

Second, while the linkId and text columns look fine, the answer column has all the answers. It's not obvious here, but it's actually a data.frame of answers. So here, a tempting approach is to remove linkId that we don't really care about, and unnest():

fromJSON(json_data)[["item"]][["item"]] |>
  bind_rows() |>
  select(-linkId) |>
  unnest(answer)
#> # A tibble: 10 × 4
#>    text                              valueString   valueInteger valueBoolean
#>    <chr>                             <chr>                <int> <lgl>       
#>  1 Quel est votre nom?               Participant 1           NA NA          
#>  2 Quel est votre nom?               Participant 2           NA NA          
#>  3 Quel est votre âge?               <NA>                    30 NA          
#>  4 Quel est votre âge?               <NA>                    25 NA          
#>  5 Quel est votre lieu de résidence? Ville A                 NA NA          
#>  6 Quel est votre lieu de résidence? Ville B                 NA NA          
#>  7 Nombre dannées dexpérience?       <NA>                     5 NA          
#>  8 Nombre dannées dexpérience?       <NA>                     2 NA          
#>  9 Êtes-vous heureux?                <NA>                    NA TRUE        
#> 10 Êtes-vous heureux?                <NA>                    NA FALSE

Well, that didn't fully work: the different answers are different types and can't be directly combined.

My suggestion here is not the simplest, but it works: you want to transpose so that rows become columns and columns become rows, but you need to do that on a list, and then you get a list where each element is a column containing a text field and an answer field, the text field is the title of the new column.

fromJSON(json_data)[["item"]][["item"]] |>
  bind_rows() |>
  select(-linkId) |>
  as.list() |>
  list_transpose()
#> [[1]]
#> [[1]]$text
#> [1] "Quel est votre nom?"
#> 
#> [[1]]$answer
#>     valueString
#> 1 Participant 1
#> 2 Participant 2
#> 
#> 
#> [[2]]
#> [[2]]$text
#> [1] "Quel est votre âge?"
#> 
#> [[2]]$answer
#>   valueInteger
#> 1           30
#> 2           25
#> 
#> 
#> [[3]]
#> [[3]]$text
#> [1] "Quel est votre lieu de résidence?"
#> 
#> [[3]]$answer
#>   valueString
#> 1     Ville A
#> 2     Ville B
#> 
#> 
#> [[4]]
#> [[4]]$text
#> [1] "Nombre dannées dexpérience?"
#> 
#> [[4]]$answer
#>   valueInteger
#> 1            5
#> 2            2
#> 
#> 
#> [[5]]
#> [[5]]$text
#> [1] "Êtes-vous heureux?"
#> 
#> [[5]]$answer
#>   valueBoolean
#> 1         TRUE
#> 2        FALSE

We can build the final data frame like this:

fromJSON(json_data)[["item"]][["item"]] |>
  bind_rows() |>
  select(-linkId) |>
  as.list() |>
  list_transpose() |>
  map_dfc(~ tibble("{.x$text}" := .x$answer))
#> # A tibble: 2 × 5
#>   Quel est votre nom?$valueStrin…¹ Quel est votre âge?$…² Quel est votre lieu …³
#>   <chr>                                             <int> <chr>                 
#> 1 Participant 1                                        30 Ville A               
#> 2 Participant 2                                        25 Ville B               
#> # ℹ abbreviated names: ¹`Quel est votre nom?`$valueString,
#> #   ²`Quel est votre âge?`$valueInteger,
#> #   ³`Quel est votre lieu de résidence?`$valueString
#> # ℹ 2 more variables: `Nombre dannées dexpérience?` <df[,1]>,
#> #   `Êtes-vous heureux?` <df[,1]>

You may still need to clean up a bit the column names

system · January 31, 2024, 3:45pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.