I used xml2 package with the following codes to get the dataframe, but the process was time-consuming. Using my laptop, It took about seven hours to read the file.
library(xml2)
library(dplyr)
pg <-read_xml("sample.xml")
node<-xml_find_all(pg, xpath = "//kf:Series")
datalist=list()
for (i in 1:2){
l<-length(xml_children( node[[i]]))
df<-data.frame(matrix(ncol = 4,nrow=l))
colnames(df) <- c('date','value',"SERIES_NAME","UNIT")
for (z in 1:length(xml_children( node[[i]]))){
df$date[z]<-xml_attrs(xml_child(node[[i]], z))[["DATE"]]
df$value[z]<-xml_attrs(xml_child(node[[i]], z))[["VALUE"]]
df$SERIES_NAME<-xml_attrs(node[[i]])[["SERIES_NAME"]]
df$UNIT<-xml_attrs(node[[i]])[["UNIT"]]
datalist[[i]]<-df
}
}
big_data=bind_rows(datalist)
Is there any way to convert the file to the dataframe faster, either by changing the code or using a different method?
Hi, I had the same problem some time ago. It took me 14 hours to convert 600MB xml file. So I invented my own, alternative solution - a really brute-force one. The trick was to manually change the extension of the source file from .xml to .txt. Then you read it into R environment using read.text() function. Now you have one, single but veeeeery long text string. In next steps you have to use the stringr package to handle character variables along with purrr and tidyr packages to manipulate the data. You have to start with str_split() function to cut this big string into initial character vector. In your case the separator (pattern) used in str_split() will be something like "<kf:Series UNIT=" but special characters will have to be escaped. This initial vector has to be converted into data frame or tibble containing only one column - this vector. Then, gradually, you have to extract further strings of your interest and to add them to your tibble using mutate() function combined with str_split() or str_split_fixed(). But I can see, that map() and unnest() and maybe some other functions are likely to be used either.
This may seem difficult at the first glance but it will take you much less time than with the use of xml2.
Thanks Jacek for response. May I know how long did it take to convert your file using the described method?
I thought the xml2 package works well for reading the file and detecting nodes. When I used the loop to extract the information, it took about 6 hours to get data. I thought, perhaps, the problem was the loop. Other posts recommended using apply family instead of a loop, but I could not figure it out for my case.
Well, reading txt file into R and cutting it into character vector took me 5-10 minutes each. Further steps take 30 seconds max but you have to write proper code for each step first, which may turn out to be time-consuming.
I have to tell, that working with xml files is unexpectedly hard task in R for reasons that I cannot explain. It's much easier and quicker with json files.
Actually, xml2 and similar packages work quite well provided that the file is not too large.