Parallelization over XML2 XML documents

I'm trying to parallelize some parsing of an {xml2} xml document. I know external pointers get created, and it looks like that doesn't work with the {furrr} package as I am using it. Is there a way to parallelize processing of a document that respects pointers (i.e. doesn't collapse to R objects with as_list())? Below is a reprex that shows what I want to do, and with it working with non-parrallelized version, and failing with my attempt at a parallelized version.

library(xml2)
library(purrr)
library(furrr)
#> Loading required package: future


# WORKS: local: mapping to elements
xmldoc <- read_xml("<root><child>a</child><child>b</child></root>")
elt_paths <- xml_find_all(xmldoc, "//child") |> map_chr(xml_path)
map(elt_paths, ~{xml_text(xml_find_first(xmldoc, .x))})
#> [[1]]
#> [1] "a"
#> 
#> [[2]]
#> [1] "b"


# DOES NOT WORK: how do I make the xml object seeable by the internal functions
tryCatch({
    xmldoc <- read_xml("<root><child>a</child><child>b</child></root>")
    elt_paths <- xml_find_all(xmldoc, "//child") |> map_chr(xml_path)
    
    plan(multisession, workers = 2)
    res <- future_map(elt_paths, ~{xml_text(xml_find_first(xmldoc, .x))})
    plan(sequential)
    print(res)
}, error=function(e) print(e))
#> <error/purrr_error_indexed>
#> Error:
#> ℹ In index: 1.
#> Caused by error in `xml_ns.xml_document()`:
#> ! external pointer is not valid
#> ---
#> Backtrace:
#>      ▆
#>   1. ├─parallel (local) workRSOCK()
#>   2. │ └─parallel:::workLoop(...)
#>   3. │   └─parallel:::workCommand(master)
#>   4. │     ├─base::tryCatch(...)
#>   5. │     │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   6. │     │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   7. │     │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>   8. │     ├─base::tryCatch(...)
#>   9. │     │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  10. │     │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  11. │     │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  12. │     ├─base::do.call(msg$data$fun, msg$data$args, quote = TRUE)
#>  13. │     └─future (local) `<fn>`(...)
#>  14. │       └─base::eval(expr, envir = envir, enclos = enclos)
#>  15. │         └─base::eval(expr, envir = envir, enclos = enclos)
#>  16. ├─base::tryCatch(...)
#>  17. │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  18. │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  19. │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  20. ├─base::withCallingHandlers(...)
#>  21. ├─base::withVisible(...)
#>  22. ├─base::local(...)
#>  23. │ └─base::eval.parent(substitute(eval(quote(expr), envir)))
#>  24. │   └─base::eval(expr, p)
#>  25. │     └─base::eval(expr, p)
#>  26. └─base::eval(...)
#>  27.   └─base::eval(...)
#>  28.     ├─base::withCallingHandlers(...)
#>  29.     ├─base::do.call(...furrr_map_fn, args)
#>  30.     └─purrr (local) `<fn>`(.x = "/root/child[1]", .f = `<fn>`)
#>  31.       └─purrr:::map_("list", .x, .f, ..., .progress = .progress)
#>  32.         ├─purrr:::with_indexed_errors(...)
#>  33.         │ └─base::withCallingHandlers(...)
#>  34.         ├─purrr:::call_with_cleanup(...)
#>  35.         └─.f(.x[[i]], ...)
#>  36.           └─global ...furrr_fn(...)
#>  37.             ├─xml2::xml_text(xml_find_first(xmldoc, .x))
#>  38.             ├─xml2::xml_find_first(xmldoc, .x)
#>  39.             └─xml2:::xml_find_first.xml_node(xmldoc, .x)
#>  40.               ├─xml2::xml_ns(x)
#>  41.               └─xml2:::xml_ns.xml_document(x)

Created on 2024-05-21 with reprex v2.1.0

This is not possible. The xml2 package uses external pointers to represent the XML document, and you cannot copy external pointers to another process, so you cannot share the data like this.

You'd need to load the same XML document separately on all processes to have access to the data.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.