Error while parsing a json file and converting to a table format

mtoufiq · January 4, 2024, 9:33pm

Hi,

I have a json file, I am interested in converting the json to a table version in the tsv file format. Can you please help me with converting from json to tsv file using R studio. From my knowledge, extracting the gene_name, brief_summary, and scores for all the 15 statements could be an issue here because they are not stored in the consistent manner in the json file exported from a different tool. Earlier, to convert the same I got assistance from Posit community but that was just a test json file I had at that time. Now, we have a real time json file, and I am facing issues converting the same and below R code doesn't work on that.

Probably, the reason for the error is the different patterns detected in the file, for instance, some scores are stored in json file like below or sometimes missing. Furthermore, "brief_summary" also looks a bit unstructured. Is there a way to flexibly parse this json file, and store in the table format? "brief_summary" column can be ignored from the table too if that is too long to print.

\"evidence_scores\"
\"scores\"
\"associations\"
\"statements\"
\"associations\"
\"gene_associations\"
\"statements_scores\"

"AGL": {
  "data-turbo_1": [
    "```json",
    "{",
    "  \"gene_name\": \"AGL\",",
    "  \"brief_summary\": \"AGL (Amylo-Alpha-1, 6-Glucosidase, 4-Alpha-Glucanotransferase) is a gene .......\",",
    "  \"evidence_scores\": {",
    
    
"AGO2": {
  "data-turbo_1": [
    "```json",
    "{",
    "  \"gene_name\": \"AGO2\",",
    "  \"brief_summary\": \"AGO2, also known as Argonaute 2, is a protein-coding gene ........\"",
    "}",
    "",
    "{",

Input json file:

Json file deposited in Google Drive.

R Code

library(jsonlite)
library(rrapply)

# Step 1
jsonToList <- jsonlite::fromJSON(txt = "./Test_v1.json")

# Step 2
extractJson <- lapply(jsonToList, function(x){
  # Step 2a
  markBeginEnd <- which(x[[1]] %in% c("```json", "```"))
  # Step 2b
  getJson <- x[[1]][(markBeginEnd[1]+1):(markBeginEnd[2]-1)]
  # Step 2c
  tempdf <- jsonlite::fromJSON(paste0(getJson, collapse = "\n"))
  names(tempdf) <- c("gene_name", "brief_summary", "scores")
  tempdf
})
# Step 3
finalDF <- rrapply::rrapply(object = extractJson, how = "bind")

Expected out table:

The output should contain 17 columns [gene_name, "brief_summary", followed by 15 score columns]

Thank you very much.

Best Regards,
Toufiq

technocrat · January 4, 2024, 10:21pm

I haven't tried to run your code because I think the file needs to be converted before going into whatever extraction process you're looking for. Take a look at how this works

library(jsonlite)
intake <- read_json("Test v1.json")
validate(intake) # FALSE
class(intake) # Its typeof list
myjson <- toJSON(mtcars)
class(myjson) # typeof json
validate(myjson) #TRUE
retry <- toJSON(intake)
class(retry)
validate(retry)  # TRUE

mtoufiq · January 4, 2024, 11:12pm

@technocrat thank you for the response.

I ran the below code on the json file I am working with it is of typeof list as you mentioned, shall I explore and re-run the tool that printed this file or there is a way to convert this file and obtain the table I am expecting?

library(jsonlite)
intake <- read_json("./Test_v1.json")
validate(intake) # FALSE
class(intake) # Its typeof list
retry <- toJSON(intake)
class(retry)
validate(retry)  # TRUE

technocrat · January 4, 2024, 11:48pm

If you have the functionality it would be far better to export from the source app as CSV. While the data may still need touchup, it is far easier to do by hand, if needed. I can't recommend trying to edit JSON. It's just too deeply nested. You'd have to write a parser.

mtoufiq · January 4, 2024, 11:59pm

@technocrat I enquired, it seems, I might not have a complete access to the source app. I have a dozen of these file types, and need to convert to table form either *.tsv or a *.csv file. Is there any other way? Thank you.

technocrat · January 5, 2024, 12:02am

I'll take a look at the list object and see what I can do

mtoufiq · January 5, 2024, 12:54am

@technocrat thank you very much for looking into this.

technocrat · January 5, 2024, 5:39am

I think what you are looking for is buried three deep.

json_object <- toJSON(intake, auto_unbox = TRUE)

will yield a valid json. Can you work with that?

mtoufiq · January 5, 2024, 12:42pm

@technocrat thanks. Yes, its very deep. I did convert to json, but still finding a way to converting to a table, and experiencing difficulty. As a last resort, I was thinking if there a way to perform pattern matching from this json object: For instance, Store AAK1 value in column A and then store these 15 scoring statements
across columns by pattern or string matching in *csv file because this seems to a flexible way of doing since some of the header or hierarchy is missing in the json object (example below).

technocrat · January 6, 2024, 12:37am

Ok, some progress noted below in the reprex. Why don't you replicate for the variables identified below in the 3,4, and 6 positions, and I'll take 5, which is buried one deeper. Then we need to see if we get the same number of rows from each variable, fill out to the length of the longest, stuff them in a vector. Next will be iterating through all the top level stuff hoping that everything is wrong in a consistent way. If so, we can the do a lapply to corral everything in the receiver data frame.

# libraries
library(jsonlite)
# data
gist <- "https://gist.githubusercontent.com/technocrat/7c6602553011f31dabddf1f12077f6a2/raw/1eb5fb0e673df107844d8048d81aede5568f9f85/test_json.json"
intake <- read_json(gist)
# functions (others to be added as developed)
unbox <- function(x) toJSON(intake[x][[1]][[1]], auto_unbox = TRUE)

# pick one item for development
test_case <- unbox(1)

# dealing with a single element shows that json
# is no longer working; it appears that an attempt
# was made to convert a data frame or csv object
# to json and w2as unsuccesful


subject <- fromJSON(test_case)
# the record is now just a 25-element character vector
str(subject)
#>  chr [1:25] "```json" "{" "  \"gene_name\": \"AAK1\"," ...

# what's there
sapply(subject,print)[1] |> unlist()
#> [1] "```json"
#> [1] "{"
#> [1] "  \"gene_name\": \"AAK1\","
#> [1] "  \"brief_summary\": \"AAK1 (AP2 associated kinase 1) is a gene that encodes a protein kinase involved in endocytosis, a process by which cells internalize molecules and particles from their environment. The AAK1 protein regulates the binding of the AP2 complex to clathrin and the plasma membrane, which is crucial for the formation of clathrin-coated vesicles. This process is important for the internalization of proteins and receptors from the cell surface and their subsequent recycling or degradation.\","
#> [1] "  \"evidence_scores\": {"
#> [1] "    \"This gene is specifically associated with the test case\": 1,"
#> [1] "    \"This gene is specifically associated with the biology of cells\": 1,"
#> [1] "    \"This gene is specifically associated with the biology of proteins\": 1,"
#> [1] "    \"This gene is specifically associated with the biology of metabolites\": 1,"
#> [1] "    \"This gene is involved in mediating the induction\": 2,"
#> [1] "    \"This gene is implicated in mediated damage\": 2,"
#> [1] "    \"This gene is involved in mediating attachment\": 3,"
#> [1] "    \"This gene is implicated in the process of replication\": 2,"
#> [1] "    \"This gene is involved in mediating entry\": 3,"
#> [1] "    \"This gene is implicated in the impairment\": 2,"
#> [1] "    \"This gene is involved in conferring drug resistance\": 2,"
#> [1] "    \"This gene is involved in mediating the response to blockage\": 3,"
#> [1] "    \"This gene is involved in mediating the initiation of cells\": 2,"
#> [1] "    \"This gene is involved in mediating the blockage\": 4,"
#> [1] "    \"This gene is involved in mediating the priming of the cells\": 3"
#> [1] "  }"
#> [1] "}"
#> [1] "```"
#> [1] ""
#> [1] "Please note that the scores provided are not based on extensive scientific evidence but are rather indicative and might be speculative. For an accurate assessment, a thorough review of scientific literature is required. The scores are meant to reflect a general sense of the gene's association with specific biological processes or disease mechanisms based on current knowledge, which may be limited."
#>   ```json 
#> "```json"

# on inspection, the there are five chunks

subject[3][[1]] # first variable
#> [1] "  \"gene_name\": \"AAK1\","
subject[4][[1]] # second variable
#> [1] "  \"brief_summary\": \"AAK1 (AP2 associated kinase 1) is a gene that encodes a protein kinase involved in endocytosis, a process by which cells internalize molecules and particles from their environment. The AAK1 protein regulates the binding of the AP2 complex to clathrin and the plasma membrane, which is crucial for the formation of clathrin-coated vesicles. This process is important for the internalization of proteins and receptors from the cell surface and their subsequent recycling or degradation.\","
# this one may have a hidden numeric variable deeper in
subject[5][[1]] # third variable
#> [1] "  \"evidence_scores\": {"
# fourth variable
# this is the one with a missing label, such as note
subject[6][[1]]
#> [1] "    \"This gene is specifically associated with the test case\": 1,"

# approach
# create a receiver data frame object to hold in
# a structured way the data as its unravelled

# this will be used later, once we know which
# rows might need to be filled out with NA
receiver <- data.frame(
  gene    = as.character(),
  summary = as.character(),
  scores  = as.character(),
  notes   = as.character()
)

# Don't worry about the names yet, wait until time
# for writing up results to make a presentation
# table

# test: extract the gene column

subject[3]
#> [1] "  \"gene_name\": \"AAK1\","

# throw away the column name and unearth 
# the first target

(step1 <- strsplit(subject[3],":")[[1]][2])
#> [1] " \"AAK1\","
(step2 <- gsub("[^A-Z0-9]+",",",step1))
#> [1] ",AAK1,"
(step3 <- gsub(",","",step2))
#> [1] "AAK1"


# demonstrate a function
get_gene <- function(x) {
  step1 = strsplit(subject[x],":")[[1]][2]
  step2 = gsub("[^A-Z0-9]+",",",step1)
  step3 = gsub(",","",step2)
  return(step3)
  }

# proof of concept
get_gene(3)
#> [1] "AAK1"

# parametize

# identify positions of variables in the lowest-most list

pick_list <- c(3,4:6)

# tweak function

get_gene <- function(x,y) {
  step1 = strsplit(x[y],":")[[1]][2]
  step2 = gsub("[^A-Z0-9]+",",",step1)
  step3 = gsub(",","",step2)
  return(step3)
}

# test it
get_gene(subject,3)
#> [1] "AAK1"

^{Created on 2024-01-05 with reprex v2.0.2}

mtoufiq · January 6, 2024, 2:28am

@technocrat thank you for your efforts. I am not very proficient with this, but let me give a try.

technocrat · January 6, 2024, 3:19am

We’ll keep after if. You’ve got the advantage of knowing what the data is supposed to look. Between us we’ll get there.

technocrat · January 6, 2024, 4:29am

This is working now. We'll do vectorizing next when you've got the summary and scores

# test it
oddball <- get_gene(subject,4) # doesn't parse

# look at it unprocessed

subject[6]

# split on :
strsplit(subject[6],":")
# now we see that we need an additional variable in
# addition to note, which is a numeric possibly 
# representing another gene?

# recreate receiver to add

receiver <- data.frame(
  gene    = as.character(),
  summary = as.character(),
  scores  = as.character(),
  notes   = as.character(),
  addn    = as.numeric()
)

# step through

get_note <- function(subject) {
  step1 = strsplit(subject[6],":")
  step2 = lapply(step1,trimws)
  note = gsub('^\"|\"$', '', step2[[1]][1])
  return(note)
}

# make into a function

get_addn <- function(subject) {
  step1 = strsplit(subject[6],":")
  step2 = lapply(step1,trimws)
  addn = as.numeric(gsub(".$","",step2[[1]][2]))
  return(addn)
}

get_addn(subject)

mtoufiq · January 6, 2024, 7:22pm

@technocrat I want to extend my sincere thanks for your ongoing support and assistance with this query we've been working on.

I must share that, despite my best efforts, I have not yet been successful in fulfilling the task as per the requirements. This challenge is somewhat new to me, and my current level of expertise in this area is still evolving. I appreciate your understanding as I navigate through this learning process.

I remain committed to resolving this matter..

vedoa · January 7, 2024, 9:06am

Hi @mtoufiq ,

this would be my approach

library(jsonlite)
library(rrapply)

# Step 1
jsonToList <- jsonlite::fromJSON(txt = "./Test_v1.json")

# Step 2
extractJson <- lapply(jsonToList, function(x){
  # Step 2c
  tempdf = tryCatch({
    # Step 2a
    markBeginEnd <- which(x[[1]] %in% c("```json", "```"))
    # Step 2b
    getJson <- x[[1]][(markBeginEnd[1]+1):(markBeginEnd[2]-1)]
    # Step 2c
    tempdf <- jsonlite::fromJSON(paste0(getJson, collapse = "\n"))
    names(tempdf) <- c("gene_name", "brief_summary", "scores")
    tempdf
  }, error = function(e) {
    "a problem occured"
  })
})
# extract the ones where something is not working
getProblemEntries <- unlist(lapply(extractJson, 
                             function(x){
                               if(class(x) == "character"){
                                 x <- "could not parse mark up (inner) json"
                                 return(x)
                               }else if(length(x) > 3){
                                 x <- "would have more then 3 Columns"
                                 return(x)
                               }else{
                                 # everything is as expected
                                 return(NULL)
                               }
                               }))
# print reason for problem
cat(paste0(names(getProblemEntries), " -> ", getProblemEntries, "\n"))
# AGO2 -> could not parse mark up (inner) json
# AKAP6 -> could not parse mark up (inner) json
# AKIRIN2 -> could not parse mark up (inner) json
# ANOS1 -> could not parse mark up (inner) json
# APEX1 -> could not parse mark up (inner) json
# ARHGAP26 -> would have more then 3 Columns
# ATF3 -> could not parse mark up (inner) json
# ATF7IP2 -> would have more then 3 Columns

# delete the problem entries from list 
extractJsonWhereItWorks <- extractJson[!(names(extractJson) %in% names(getProblemEntries))]
# Step 3
finalDF <- rrapply::rrapply(object = extractJsonWhereItWorks, how = "bind")

So the above code does the same as you wanted but if an error occurs it gives back a string "a problem occured". The vector getProblemEntries contains all the entries where you either have a problem while parsing the json or the json contains more then 3 entries (which is also a problem since you expect only 3 columns).

These 9 entries have to be fixed. On the flip side you would have the rest (179 entries) in the dataframe as intended.

Hope it helps you investigate further and gives at least a partial result since 95% of the data is extracted.

Only gene, summary and scores are extracted - the rest is ignored.

technocrat · January 7, 2024, 9:25am

Is this the output you're looking for each of the 188?

# libraries
library(jsonlite)
# data
gist <- "https://gist.githubusercontent.com/technocrat/7c6602553011f31dabddf1f12077f6a2/raw/1eb5fb0e673df107844d8048d81aede5568f9f85/test_json.json"
intake <- read_json(gist)

# functions
get_addn <- function(x) {
  step1 = strsplit(subject[x],":")
  step2 = lapply(step1,trimws)
  addn = as.numeric(gsub(".$","",step2[[1]][2]))
  return(addn)
}

get_note <- function(x) {
  step1 = strsplit(subject[x],":")
  step2 = lapply(step1,trimws)
  note = gsub("^.|.$","",step2[[1]][1])
  return(note)
}

get_var <- function(x) {
  step1 = strsplit(subject[x],":")[[1]][2]
  step2 = gsub("[^A-Z0-9]+",",",step1)
  step3 = gsub(",","",step2)
  return(step3)
}

populate_receiver <- function(x) {
  new_row = data.frame(
    gene     = get_var(3),
    summary  = get_var(4),
    notes6   = get_note(6),
    addn6    = get_addn(6),
    note7    = get_note(7),
    addn7    = get_addn(7),
    notes8   = get_note(8),
    addn8    = get_addn(8),
    notes9   = get_note(9),
    addn9    = get_addn(9),
    notes10  = get_note(10),
    addn10   = get_addn(10),
    notes11  = get_note(11),
    addn11   = get_addn(11),
    notes12  = get_note(12),
    addn12   = get_addn(12),
    notes13  = get_note(13),
    addn13   = get_addn(13),
    notes14  = get_note(14),
    addn14   = get_addn(14),
    notes15  = get_note(15),
    addn15   = get_addn(15),
    notes16  = get_note(16),
    addn16   = get_addn(16),
    notes17  = get_note(17),
    addn17   = get_addn(17),
    notes18  = get_note(18),
    addn18   = get_addn(18),
    notes19  = get_note(19),
    addn19   = get_addn(19),
    notes20  = get_note(20),
    addn20   = get_addn(20)
  )
  return(new_row)
}

unbox <- function(x) toJSON(intake[x][[1]][[1]], auto_unbox = TRUE)

# pick one item for development
test_case <- unbox(1)

# unpack
subject <- fromJSON(test_case)

# create receiver to add processed items

receiver <- data.frame(
  gene    = as.character(),
  summary = as.character(),
  notes6 = as.character(),
  addn6 = as.numeric(),
  notes7 = as.character(),
  addn7 = as.numeric(),
  notes8 = as.character(),
  addn8 = as.numeric(),
  notes9 = as.character(),
  addn9 = as.numeric(),
  notes10 = as.character(),
  addn10 = as.numeric(),
  notes11 = as.character(),
  addn11 = as.numeric(),
  notes12 = as.character(),
  addn12 = as.numeric(),
  notes13 = as.character(),
  addn13 = as.numeric(),
  notes14 = as.character(),
  addn14 = as.numeric(),
  notes15 = as.character(),
  addn15 = as.numeric(),
  notes16 = as.character(),
  addn16 = as.numeric(),
  notes17 = as.character(),
  addn17 = as.numeric(),
  notes18 = as.character(),
  addn18 = as.numeric(),
  notes19 = as.character(),
  addn19 = as.numeric(),
  notes20 = as.character(),
  addn20 = as.numeric()
)

for (i in 1) receiver = rbind(receiver, populate_receiver(i))

receiver
#>   gene           summary
#> 1 AAK1 AAK1AP21TAAK1AP2T
#>                                                    notes6 addn6
#> 1 This gene is specifically associated with the test case     1
#>                                                            note7 addn7
#> 1 This gene is specifically associated with the biology of cells     1
#>                                                              notes8 addn8
#> 1 This gene is specifically associated with the biology of proteins     1
#>                                                                 notes9 addn9
#> 1 This gene is specifically associated with the biology of metabolites     1
#>                                            notes10 addn10
#> 1 This gene is involved in mediating the induction      2
#>                                      notes11 addn11
#> 1 This gene is implicated in mediated damage      2
#>                                         notes12 addn12
#> 1 This gene is involved in mediating attachment      3
#>                                                 notes13 addn13
#> 1 This gene is implicated in the process of replication      2
#>                                    notes14 addn14
#> 1 This gene is involved in mediating entry      3
#>                                     notes15 addn15
#> 1 This gene is implicated in the impairment      2
#>                                               notes16 addn16
#> 1 This gene is involved in conferring drug resistance      2
#>                                                       notes17 addn17
#> 1 This gene is involved in mediating the response to blockage      3
#>                                                      notes18 addn18
#> 1 This gene is involved in mediating the initiation of cells      2
#>                                           notes19 addn19
#> 1 This gene is involved in mediating the blockage      4
#>                                                       notes20 addn20
#> 1 This gene is involved in mediating the priming of the cells     NA
# for legibility
t(receiver)
#>         [,1]                                                                  
#> gene    "AAK1"                                                                
#> summary "AAK1AP21TAAK1AP2T"                                                   
#> notes6  "This gene is specifically associated with the test case"             
#> addn6   "1"                                                                   
#> note7   "This gene is specifically associated with the biology of cells"      
#> addn7   "1"                                                                   
#> notes8  "This gene is specifically associated with the biology of proteins"   
#> addn8   "1"                                                                   
#> notes9  "This gene is specifically associated with the biology of metabolites"
#> addn9   "1"                                                                   
#> notes10 "This gene is involved in mediating the induction"                    
#> addn10  "2"                                                                   
#> notes11 "This gene is implicated in mediated damage"                          
#> addn11  "2"                                                                   
#> notes12 "This gene is involved in mediating attachment"                       
#> addn12  "3"                                                                   
#> notes13 "This gene is implicated in the process of replication"               
#> addn13  "2"                                                                   
#> notes14 "This gene is involved in mediating entry"                            
#> addn14  "3"                                                                   
#> notes15 "This gene is implicated in the impairment"                           
#> addn15  "2"                                                                   
#> notes16 "This gene is involved in conferring drug resistance"                 
#> addn16  "2"                                                                   
#> notes17 "This gene is involved in mediating the response to blockage"         
#> addn17  "3"                                                                   
#> notes18 "This gene is involved in mediating the initiation of cells"          
#> addn18  "2"                                                                   
#> notes19 "This gene is involved in mediating the blockage"                     
#> addn19  "4"                                                                   
#> notes20 "This gene is involved in mediating the priming of the cells"         
#> addn20  NA

^{Created on 2024-01-07 with [reprex v2.0.2]}

mtoufiq · January 7, 2024, 1:22pm

@technocrat yes, thats correct, except I see mismatch in summary and missing score in addn20 column.

mtoufiq · January 7, 2024, 1:29pm

@vedoa thank you very much. Ofcourse, this is helpful though 9 entries have to be fixed. With this I did test on remaining json files I had, the number varies for each file.

technocrat · January 7, 2024, 7:43pm

I see what happened in summary, which I can fix with a new function using its own regex pattern. I'm not sure why addn 20 isn't showing. I'll check.

technocrat · January 8, 2024, 6:23am

For summary we'll use

get_summary <- function(x){
  strsplit(subject[x],":")[[1]][2] |> 
    gsub("^ \"|\",$",",",x=_)      |> 
    gsub(",","",x=_)
}

and to fix the NA issue, change get_var()

get_var <- function(x) {
  step1 = strsplit(subject[x],":")[[1]][2]
  step2 = gsub("[^A-Z0-9]+",",",step1)
  if(grepl(",$",subject[x]))  step3 = gsub(",","",step2)
  if(!grepl(",$",subject[x])) step3 = step2
  step4 = trimws(step3)
  return(step4)
}

to fix a regex problem with a missing ending comma in 20 source data, but not the others. There may be other inconsistencies that will require further tweaking. I'll be looking.