efg
December 28, 2020, 3:43pm
1
Why do xml2
's xml_path()
and xml_find_all()
functions "fail" when namespaces are present in an XML file, but work "correctly" when the namespaces are edited out? How can I tell xml2
's functions to ignore namespaces?
Here's a six line version of the original 2000+ line XML file showing the problem output:
library(tidyverse)
library(xml2)
s <-
'<?xml version="1.0" encoding="utf-8"?>
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2016v3.0">
<ReturnHeader binaryAttachmentCnt="0">
<ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
</ReturnHeader>
</Return>
'
doc <- read_xml(paste(s, collapse = "\n")) # easier way to do this?
doc %>% xml_find_all('//*') %>% xml_path()
xml_find_all(doc, "//ReturnTs")
[1] "/*" "/*/*" "/*/*/*"
{xml_nodeset (0)}
If I had included two more lines in the sample XML file the output would show the numbers in brackets:
[1] "/*" "/*/*" "/*/*/*[1]" "/*/*/*[2]" "/*/*/*[3]"
If I edit out the xmlns
specification I see parsing and query results I'm expecting from xml2
:
s <-
'<?xml version="1.0" encoding="utf-8"?>
<Return returnVersion="2016v3.0">
<ReturnHeader binaryAttachmentCnt="0">
<ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
</ReturnHeader>
</Return>
'
doc <- read_xml(paste(s, collapse = "\n")) # easier way to do this?
doc %>% xml_find_all('//*') %>% xml_path()
xml_find_all(doc, "//ReturnTs")
[1] "/Return" "/Return/ReturnHeader"
[3] "/Return/ReturnHeader/ReturnTs"
{xml_nodeset (1)}
[1] <ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
How can I tell the xml2
functions to ignore the specified namespaces?
AlexisW
December 30, 2020, 2:30am
2
I think this is due to the lack of prefix, so that the XPath specification does not know which node you're referring to, as discussed here or there . It works if you specify the prefix explicitly:
'<?xml version="1.0" encoding="utf-8"?>
<xsi:Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2016v3.0">
<xsi:ReturnHeader binaryAttachmentCnt="0">
<xsi:ReturnTs>2017-07-05T19:04:21-05:00</xsi:ReturnTs>
</xsi:ReturnHeader>
</xsi:Return>' %>%
read_xml() %>%
xml_find_all('//*') %>%
xml_path()
#> [1] "/xsi:Return"
#> [2] "/xsi:Return/xsi:ReturnHeader"
#> [3] "/xsi:Return/xsi:ReturnHeader/xsi:ReturnTs"
And also works if you use xml_set_namespace()
on each node before extracting:
'<?xml version="1.0" encoding="utf-8"?>
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2016v3.0">
<ReturnHeader binaryAttachmentCnt="0">
<ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
</ReturnHeader>
</Return>
' %>%
read_xml() %>% xml_find_all('//*') %>%
map_chr(., ~ xml_set_namespace(.x,prefix="xsi", uri="http://www.w3.org/2001/XMLSchema-instance") %>%
xml_path())
#> [1] "/xsi:Return"
#> [2] "/xsi:Return/xsi:ReturnHeader"
#> [3] "/xsi:Return/xsi:ReturnHeader/xsi:ReturnTs"
Or ignoring the namespace in the XPath formulation:
'<?xml version="1.0" encoding="utf-8"?>
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2016v3.0">
<ReturnHeader binaryAttachmentCnt="0">
<ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
</ReturnHeader>
</Return>
' %>%
read_xml() %>%
xml_find_all("//*[local-name()='ReturnTs']")
#> {xml_nodeset (1)}
#> [1] <ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
system
Closed
January 20, 2021, 2:30am
3
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.