How to convert a partly nested XML to data frame using xml2

jchou · August 2, 2019, 7:57pm

Apologies for what is likely a very trivial question with a trivial solution.

I have an XML file that I'd like to read into a data frame using xml2, but despite a few hours Google searching, I was unsuccessful. This is probably because I'm not at all familiar with XML.

And after all that, I found a 2 liner with XML that mostly does what I'd like (reprex below). What it doesn't do is automatically attempt to infer the column types.

My actual XML has thousands of records and hundreds of variables, so manually specifying the column types would be inconvenient.

My questions:

is there a simple way to do the same thing with xml2?
would it be possible for it to semi-intelligently figure out column types?

I think it'd be horrifically ugly to use XML as below, save it as a CSV text file with write.csv, and then import it back with read_csv, to take advantage of inferring the column types.

I'm guessing the solution will be trivial, yet somehow I've been unable to find it!

Thanks!

library(XML)

xml_doc <-"
<DATA>

    <RECORD>
        <VAR1>string1</VAR1>
        <VAR2>1</VAR2>
        <VAR3>2.3</VAR3>
        <VAR4>TRUE</VAR4>
    </RECORD>

    <RECORD>
        <VAR1>string2</VAR1>
        <VAR2>2</VAR2>
        <VAR3>3.4</VAR3>
        <VAR4>FALSE</VAR4>
    </RECORD>

    <RECORD>
        <VAR1>string3</VAR1>
        <VAR2>3</VAR2>
        <VAR3>4.5</VAR3>
        <VAR4>TRUE</VAR4>
    </RECORD>

</DATA>
"

doc <-  xmlParse(xml_doc)
df <- xmlToDataFrame(doc, stringsAsFactors = FALSE)
df
#>      VAR1 VAR2 VAR3  VAR4
#> 1 string1    1  2.3  TRUE
#> 2 string2    2  3.4 FALSE
#> 3 string3    3  4.5  TRUE
str(df)
#> 'data.frame':    3 obs. of  4 variables:
#>  $ VAR1: chr  "string1" "string2" "string3"
#>  $ VAR2: chr  "1" "2" "3"
#>  $ VAR3: chr  "2.3" "3.4" "4.5"
#>  $ VAR4: chr  "TRUE" "FALSE" "TRUE"

pieterjanvc · August 2, 2019, 9:19pm

Hi,

Using dplyr is the tidiest option

library(XML)
library(dplyr)

xml_doc <-"
<DATA>

   <RECORD>
       <VAR1>string1</VAR1>
       <VAR2>1</VAR2>
       <VAR3>2.3</VAR3>
       <VAR4>TRUE</VAR4>
   </RECORD>

   <RECORD>
       <VAR1>string2</VAR1>
       <VAR2>2</VAR2>
       <VAR3>3.4</VAR3>
       <VAR4>FALSE</VAR4>
   </RECORD>

   <RECORD>
       <VAR1>string3</VAR1>
       <VAR2>3</VAR2>
       <VAR3>4.5</VAR3>
       <VAR4>TRUE</VAR4>
   </RECORD>

</DATA>"

myXML = xmlParse(xml_doc)
myData = xmlToDataFrame(myXML, stringsAsFactors = FALSE,) %>% 
                        mutate_all(~type.convert(., as.is = T))

'data.frame':	3 obs. of  4 variables:
 $ VAR1: chr  "string1" "string2" "string3"
 $ VAR2: int  1 2 3
 $ VAR3: num  2.3 3.4 4.5
 $ VAR4: logi  TRUE FALSE TRUE

Set the as.is to FALSE if you prefer factors instead of strings

Hope this is what you're looking for...
PJ

jchou · August 5, 2019, 1:04pm

Thank you! That's a very handy function I wasn't aware of, and does the column type inference perfectly.

I'm a little surprised that the xml2 package doesn't seem to have a simple way to read this apparently relatively common XML structure, but I suppose I can keep using the XML package and then feed into tidier tools.

Thanks again!

cderv · August 5, 2019, 4:30pm

Here is a way to do it using tidyverse more than xml2 itself because read xml is converted as list directly.

This use the new tidyr function currently in dev version for nesting list.
Also, readr as a type_convert function to activate its parsers on a data.frame

xml_doc <-"
<DATA>

    <RECORD>
        <VAR1>string1</VAR1>
        <VAR2>1</VAR2>
        <VAR3>2.3</VAR3>
        <VAR4>TRUE</VAR4>
    </RECORD>

    <RECORD>
        <VAR1>string2</VAR1>
        <VAR2>2</VAR2>
        <VAR3>3.4</VAR3>
        <VAR4>FALSE</VAR4>
    </RECORD>

    <RECORD>
        <VAR1>string3</VAR1>
        <VAR2>3</VAR2>
        <VAR3>4.5</VAR3>
        <VAR4>TRUE</VAR4>
    </RECORD>

</DATA>
"

library(xml2)

# convert xml as a R list
doc <- as_list(read_xml(xml_doc))
str(doc$DATA, 1)
#> List of 3
#>  $ RECORD:List of 4
#>  $ RECORD:List of 4
#>  $ RECORD:List of 4
library(tidyr)
packageVersion("tidyr")
#> [1] '0.8.3.9000'
# put list into a one column tible
tibble::as_tibble(doc) %>%
  # new tidyr function
  unnest_wider(DATA) %>%
  # unnest same length list cols
  unnest(cols = names(.)) %>%
  unnest(cols = names(.)) %>%
  # convert using readr parser
  readr::type_convert()
#> Parsed with column specification:
#> cols(
#>   VAR1 = col_character(),
#>   VAR2 = col_double(),
#>   VAR3 = col_double(),
#>   VAR4 = col_logical()
#> )
#> # A tibble: 3 x 4
#>   VAR1     VAR2  VAR3 VAR4 
#>   <chr>   <dbl> <dbl> <lgl>
#> 1 string1     1   2.3 TRUE 
#> 2 string2     2   3.4 FALSE
#> 3 string3     3   4.5 TRUE

^{Created on 2019-08-05 by the reprex package (v0.3.0)}

Not sure it is more efficient than XML but if you want not to use XML it is an option.
xml2 has no equivalent of xmlToDataFrame as far as I know.

jchou · August 6, 2019, 2:16pm

This tidyverse solution looks very promising, thank you! Will keep an eye out for when unnest_wider makes it into release.

(I'm trying to generate code I can share to people with potentially very limited R experience, so minimizing the number of packages that need to be installed, and keeping the packages as likely as possible to "behave" with each other in the future, is a major consideration.)

But both of the responses are super helpful, thanks again!

system · August 13, 2019, 2:16pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.