Apologies for what is likely a very trivial question with a trivial solution.
I have an XML file that I'd like to read into a data frame using xml2, but despite a few hours Google searching, I was unsuccessful. This is probably because I'm not at all familiar with XML.
And after all that, I found a 2 liner with XML that mostly does what I'd like (reprex below). What it doesn't do is automatically attempt to infer the column types.
My actual XML has thousands of records and hundreds of variables, so manually specifying the column types would be inconvenient.
My questions:
is there a simple way to do the same thing with xml2?
would it be possible for it to semi-intelligently figure out column types?
I think it'd be horrifically ugly to use XML as below, save it as a CSV text file with write.csv, and then import it back with read_csv, to take advantage of inferring the column types.
I'm guessing the solution will be trivial, yet somehow I've been unable to find it!
Thank you! That's a very handy function I wasn't aware of, and does the column type inference perfectly.
I'm a little surprised that the xml2 package doesn't seem to have a simple way to read this apparently relatively common XML structure, but I suppose I can keep using the XML package and then feed into tidier tools.
Here is a way to do it using tidyverse more than xml2 itself because read xml is converted as list directly.
This use the new tidyr function currently in dev version for nesting list.
Also, readr as a type_convert function to activate its parsers on a data.frame
xml_doc <-"
<DATA>
<RECORD>
<VAR1>string1</VAR1>
<VAR2>1</VAR2>
<VAR3>2.3</VAR3>
<VAR4>TRUE</VAR4>
</RECORD>
<RECORD>
<VAR1>string2</VAR1>
<VAR2>2</VAR2>
<VAR3>3.4</VAR3>
<VAR4>FALSE</VAR4>
</RECORD>
<RECORD>
<VAR1>string3</VAR1>
<VAR2>3</VAR2>
<VAR3>4.5</VAR3>
<VAR4>TRUE</VAR4>
</RECORD>
</DATA>
"
library(xml2)
# convert xml as a R list
doc <- as_list(read_xml(xml_doc))
str(doc$DATA, 1)
#> List of 3
#> $ RECORD:List of 4
#> $ RECORD:List of 4
#> $ RECORD:List of 4
library(tidyr)
packageVersion("tidyr")
#> [1] '0.8.3.9000'
# put list into a one column tible
tibble::as_tibble(doc) %>%
# new tidyr function
unnest_wider(DATA) %>%
# unnest same length list cols
unnest(cols = names(.)) %>%
unnest(cols = names(.)) %>%
# convert using readr parser
readr::type_convert()
#> Parsed with column specification:
#> cols(
#> VAR1 = col_character(),
#> VAR2 = col_double(),
#> VAR3 = col_double(),
#> VAR4 = col_logical()
#> )
#> # A tibble: 3 x 4
#> VAR1 VAR2 VAR3 VAR4
#> <chr> <dbl> <dbl> <lgl>
#> 1 string1 1 2.3 TRUE
#> 2 string2 2 3.4 FALSE
#> 3 string3 3 4.5 TRUE
This tidyverse solution looks very promising, thank you! Will keep an eye out for when unnest_wider makes it into release.
(I'm trying to generate code I can share to people with potentially very limited R experience, so minimizing the number of packages that need to be installed, and keeping the packages as likely as possible to "behave" with each other in the future, is a major consideration.)
But both of the responses are super helpful, thanks again!