slow performance when creating moderately large XML files

dusadrian · July 25, 2019, 9:59pm

I need to create codebooks for social science datasets (typical number of variables above 5-600). I used the base function cat() with good results, but xml2 seems like a much better alternative. However, xml2 needs about 15-16 seconds per dataset while creating the same XML file with sink() and cat() doesn't take more than 4-5 seconds.

Below you can find a trimmed down reprex that on my computer takes about 8 seconds to create the XML file. In the real situation, there are many more nodes and attributes to create (some depend on other logical conditions), but this is the bottleneck.

library(xml2)
missing <- c(-1, -2, -3)
values <- c("Very weak" = 1, "Weak" = 2, "Middle" = 3, "Strong" = 4, "Very strong" = 5, "Don't know" = -1)

root <- xml_new_document()
codeBook <- xml_add_child(root, "codeBook")
dataDscr <- xml_add_child(codeBook, "dataDscr")

for (i in seq(600)) {
    var <- xml_add_child(dataDscr, "var", name = paste("V", i, sep = "_"))
    
    if (TRUE) { # something needs to be checked here, as an example
        xml_attr(var, "nature") <- "ordinal"
        xml_attr(var, "representationType") <- "text"
    }

    labl <- xml_add_child(var, "labl")
    xml_text(labl) <- paste("Variable label for V", i, sep = "_")

    for (v in seq(length(values))) {
        ismiss <- is.element(values[v], missing)
        catgry <- xml_add_child(var, "catgry")
        if (ismiss) xml_attr(catgry, "missing") <- "Y"
        catValu <- xml_add_child(catgry, "catValu")
        xml_text(catValu) <- as.character(values[v])
        labl <- xml_add_child(catgry, "labl")
        xml_text(labl) <- names(values)[v]
    }
}

write_xml(root, "test.xml")

pieterjanvc · July 28, 2019, 11:33pm

Hi,

I have not much experience with XML, but I got intrigued and played with the functions until I found something that might be quicker.... I noticed that the first xml_add_child process was the bottleneck, and I found our that xml_add_sibling works way faster (don't ask me why )

In order for a sibling to be added, there needs to be already a child, so I created an empty one that I removed in the end. This led me to rewrite the function like this:

library(xml2)
library(dplyr)
missing <- c(-1, -2, -3)
values <- c("Very weak" = 1, "Weak" = 2, "Middle" = 3, "Strong" = 4, "Very strong" = 5, "Don't know" = -1)

root <- xml_new_document()
codeBook <- xml_add_child(root, "codeBook")
dataDscr <- xml_add_child(codeBook, "dataDscr")

#Add empty child
empty <- xml_add_child(dataDscr, "empty")

for (i in 1:1000) {
  var <-  xml_add_sibling(empty, "var", name = paste("V", i, sep = "_"), .where = "before")
  
  if (TRUE) { # something needs to be checked here, as an example
    xml_attr(var, "nature") <- "ordinal"
    xml_attr(var, "representationType") <- "text"
  }
  
  labl1 <- xml_add_child(var, "labl")
  xml_text(labl1) <- paste("Variable label for V", i, sep = "_")
  
  for (v in seq(length(values))) {
    ismiss <- is.element(values[v], missing)
    catgry <- xml_add_sibling(labl1, "catgry", .where = "before")
    if (ismiss) xml_attr(catgry, "missing") <- "Y"
    catValu <- xml_add_child(catgry, "catValu")
    xml_text(catValu) <- as.character(values[v])
    labl <- xml_add_sibling(catValu, "labl", .where = "before")
    xml_text(labl) <- names(values)[v]
  }
}

#Get rid of empty child
xml_remove(empty)

This code runs much faster now.

Just on my journey to finding a result, I also combined both xml functions and paste and found a way to create the set almost instantaneously (though this code is much muddier so I'm not a huge fan):

for (i in 1:1000) {
  
  #Add sibling to empty child (very fast)
  empty %>%  xml_add_sibling("var", name = paste("V", i, sep = "_"),
                           nature =  if(T){"ordinal"},  xml_attr =  if(T){"text"}, .where = "before") %>% 
    xml_add_child("labl", paste0("Variable label for V_", i)) %>% 
    xml_add_sibling(read_xml(
      #Group the categories together to be able to paste them all together (need root)
      paste0("<catgrys>", 
             paste0("<catgry",unlist(sapply(is.element(values, missing), function(x){ifelse(x, ' missing="Y"', "")})), 
                    "><catValu>",values, "</catValu><labl>", names(values),"</labl></catgry>", 
                    collapse = ""), "</catgrys>")
      ))
 
}

Let me know if you find other ways!

PJ

dusadrian · July 29, 2019, 5:36pm

Hi PJ,

Wow, the combination of xml and paste functions is really quick. Thanks very much, please let me digest this for a while and return if I find a more demanding use case.

Thumbs up,
Adrian

dusadrian · July 31, 2019, 7:48am

Did some more testing, and found that paste() alone is still the fastest.

Which, with all due respect for the work of the authors of the xml2 package, still begs the question of the utility (in terms of speed) of this package...

library(xml2)
missing <- c(-1, -2, -3)
values <- c("Very weak" = 1, "Weak" = 2, "Middle" = 3, "Strong" = 4, "Very strong" = 5, "Don't know" = -1)

root <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "<codeBook>", "<dataDscr>")

for (i in seq(600)) {
  root <- paste(root,
    paste("<var name =\"", paste("V", i, sep = "_"), "\" nature = \"ordinal\" representationType = \"text\">"),
    paste("<labl>", paste("Variable label for V", i, sep = "_"), "</labl>"),
    paste("<catgrys>", 
             paste("<catgry",unlist(sapply(is.element(values, missing), function(x){ifelse(x, ' missing="Y"', "")})), 
                    "><catValu>",values, "</catValu><labl>", names(values),"</labl></catgry>", 
                    collapse = ""), "</catgrys>"),
    "</var>")
}
root <- paste(root, "</dataDscr></codeBook>")

write_xml(read_xml(root), "test2.xml")

pieterjanvc · July 31, 2019, 11:41am

Hi,

Yea I agree... Since I'm new to XML, I thought it was my ignorance that led to not using the functions properly, but it does seem to be the case that just plain paste is way faster.

My only explanation why XML2 is slower for a good reason would be that it is taking all meta-data into account like the hierarchy and it's constantly checking the validity of the document when performing operations. This would slow things down, but ensures that the output is valid XML, whereas with the paste technique, there is no guarantee the document has the proper structure and obeys all XML rules.

PJ

dusadrian · July 31, 2019, 3:20pm

Hi PJ,

I agree with you, but somewhat also disagree. If a document obeys all XML rules is still not a guarantee the document is valid. Usually, such XML files are validated against a schema (where even the order of some entries is important).

As the validation process is mandatory upon creating the XML file, this process would also detect if the document has the proper structure. Not to mention that read_xml() would make sure the document has the proper structure, otherwise it would throw an error.

So it all boils down to how fast this document is produced, and apparently a paste() is the fastest possible way to produce it. The xml2 package seems to be fine for simple and basic examples, but when it comes to production mode on really testing situations, the only usable functions are read_xml() and write_xml().

Best,
Adrian

pieterjanvc · August 1, 2019, 12:54am

Hi,

Interesting info indeed! Well at least your question taught me more about XML and we now know to avoid the XML2 package for large operations

It was a stimulating conversation,
PJ

system · August 22, 2019, 12:54am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.