xml2's xml_path() shows asterisks and numbers in brackets instead of names

efg · December 28, 2020, 3:43pm

Why do xml2's xml_path() and xml_find_all() functions "fail" when namespaces are present in an XML file, but work "correctly" when the namespaces are edited out? How can I tell xml2's functions to ignore namespaces?

Here's a six line version of the original 2000+ line XML file showing the problem output:

library(tidyverse)
library(xml2)

s <- 
'<?xml version="1.0" encoding="utf-8"?>
<Return  xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2016v3.0">
  <ReturnHeader binaryAttachmentCnt="0">
    <ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
  </ReturnHeader>   
</Return>     
'

doc <- read_xml(paste(s, collapse = "\n"))  # easier way to do this?
doc %>% xml_find_all('//*')  %>% xml_path()
xml_find_all(doc, "//ReturnTs")

[1] "/*"     "/*/*"   "/*/*/*"
{xml_nodeset (0)}

If I had included two more lines in the sample XML file the output would show the numbers in brackets:

[1] "/*"        "/*/*"      "/*/*/*[1]" "/*/*/*[2]" "/*/*/*[3]"

If I edit out the xmlns specification I see parsing and query results I'm expecting from xml2:

s <- 
'<?xml version="1.0" encoding="utf-8"?>
<Return  returnVersion="2016v3.0">
  <ReturnHeader binaryAttachmentCnt="0">
    <ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
  </ReturnHeader>   
</Return>     
'

doc <- read_xml(paste(s, collapse = "\n"))  # easier way to do this?
doc %>% xml_find_all('//*')  %>% xml_path()
xml_find_all(doc, "//ReturnTs")

[1] "/Return"                       "/Return/ReturnHeader"         
[3] "/Return/ReturnHeader/ReturnTs"
{xml_nodeset (1)}
[1] <ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>

How can I tell the xml2 functions to ignore the specified namespaces?

AlexisW · December 30, 2020, 2:30am

I think this is due to the lack of prefix, so that the XPath specification does not know which node you're referring to, as discussed here or there. It works if you specify the prefix explicitly:

'<?xml version="1.0" encoding="utf-8"?>
<xsi:Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2016v3.0">
  <xsi:ReturnHeader binaryAttachmentCnt="0">
    <xsi:ReturnTs>2017-07-05T19:04:21-05:00</xsi:ReturnTs>
  </xsi:ReturnHeader>   
</xsi:Return>' %>%
  read_xml() %>%
  xml_find_all('//*') %>%
  xml_path()
#> [1] "/xsi:Return"                              
#> [2] "/xsi:Return/xsi:ReturnHeader"             
#> [3] "/xsi:Return/xsi:ReturnHeader/xsi:ReturnTs"

And also works if you use xml_set_namespace() on each node before extracting:

'<?xml version="1.0" encoding="utf-8"?>
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2016v3.0">
  <ReturnHeader binaryAttachmentCnt="0">
    <ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
  </ReturnHeader>   
</Return>     
' %>%
  read_xml() %>% xml_find_all('//*') %>%
  map_chr(., ~ xml_set_namespace(.x,prefix="xsi", uri="http://www.w3.org/2001/XMLSchema-instance") %>%
        xml_path())
#> [1] "/xsi:Return"                              
#> [2] "/xsi:Return/xsi:ReturnHeader"             
#> [3] "/xsi:Return/xsi:ReturnHeader/xsi:ReturnTs"

Or ignoring the namespace in the XPath formulation:

'<?xml version="1.0" encoding="utf-8"?>
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2016v3.0">
  <ReturnHeader binaryAttachmentCnt="0">
    <ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
  </ReturnHeader>   
</Return>     
' %>%
  read_xml() %>%
  xml_find_all("//*[local-name()='ReturnTs']")  
#> {xml_nodeset (1)}
#> [1] <ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>

system · January 20, 2021, 2:30am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.