Parsing XML/HTML with R. Getting Errors

lilshell43 · February 26, 2021, 5:25pm

Hello, I'm still new to R and I'm not good with XML or html. So I'm trying to receive information from the web using API and turn it into a dataframe. All the information I got back is in XML, with most the API calls, it worked well with the code below:

library(httr)
library(xml2)
library(dplyr)

#Reformat the function source to make it more readable
#Translate that to a plain POST call without namespacing
getInfoInJsonCont <- POST(url = "https://app.bluefolder.com/api/2.0/contracts/list.aspx", 
                      body = "<request><contractList><listType></listType></contractList></request>",
                      authenticate(user = "TOKEN", password = "x"), 
                      verbose(), 
                      add_headers(), 
                      encode = "json")

#Creating data frame object
ContractDF = data.frame(matrix(nrow = 0, ncol=2))
colnames(ContractDF) = c("contractId", "contractName")

#Parsing and getting the information for a specific node
ContractXML =  content(getInfoInJsonCont) %>% xml2::xml_find_all("//contract")
for(contract in ContractXML){
  contractId = contract %>% xml_find_all(".//contractId") %>% xml_text()
  contractName =  contract %>% xml_find_all(".//contractName") %>% xml_text()
  
  #Using rbind to create the columns for the data frame
  ContractDF = rbind(ContractDF,data.frame(contractId=contractId,
                                             contractName=contractName
 ))
  
}

There's one API that's giving me errors because when it tries to parse the data, there's invalid characters. After looking through the file, it has these characters in it:

"
'
&#x0D
&#x20
&#x0A
&#x0B
&#x1C

Here's the steps I tried to do but I'm stuck:

library(httr)
library(xml2)
library(dplyr)
library(XML)
library(plyr)

getInfoInJsonSR <- POST(url = "https://app.bluefolder.com/api/2.0/serviceRequests/list.aspx", 
     body = "<request><listType>full</listType></request>",
     authenticate(user = "TOKEN", password = "x"), 
     verbose(), 
     add_headers(), 
     encode = "json")

#Approach is to get the content as text so I can try to clean the data with gsub.
serviceXML =  content(getInfoInJsonSR, type = "text", encoding = "UTF-8")

#Change working directory to be able to save on network share drive
setwd("\\\\hwo-file\\ExampleLocation")

#Save the data in an txt file format
write.table(serviceXML, file="Sample.txt")

#Formatting and using gsub to get rid of invalid XML characters to successfully parse the data.
# Read a txt file
tx <- readLines("Sample.txt")
tx <- gsub("'", "", tx)
tx <- gsub('"', "", tx)
tx <- gsub("&#x0D", "", tx)
tx <- gsub("&#x20", "", tx)
tx <- gsub("&#x0A", "", tx)
tx <- gsub("&#x0B", "", tx)
tx <- gsub("&#x1C", "", tx)
tx <- gsub("&", "", tx)

#The full tx object is a long list so I try to convert the list into 1 string.
tx2 <- paste( unlist(tx), collapse='')

#Exporting clean file to an XML file format
write.table(tx2,file("Sample2.txt"))

#Parsing the clean XML File
data <- xmlParse(file = "Sample2.txt")

When I try xmlParse, I get the error:

Error: 1: Start tag expected, '<' not found

I know the start tag is there but I can't get past this error. I want to be able to parse the data successfully and make a data frame. Here is a sample from what I get from the:
serviceXML = content(getInfoInJsonSR, type = "text", encoding = "UTF-8")

"x"
"1" "<?xml version=\"1.0\" ?><response status='ok'><serviceRequestList><serviceRequest><accountManagerId>11111</accountManagerId><billable>0</billable><billableTotal>0.0000000000</billableTotal><billingStatus>Not Billed</billingStatus><costTotal>0.0000</costTotal><customerContactEmail>example@imperial.nhs.uk</customerContactEmail><customerContactId>2222222</customerContactId><customerContactName>Example Example</customerContactName><customerContactPhone>0044 (0)000 111 2222</customerContactPhone><customerContactPhoneMobile></customerContactPhoneMobile><customerId>444444</customerId><customerLocationCity>London</customerLocationCity><customerLocationCountry>United Kingdom</customerLocationCountry><customerLocationId>9999999</customerLocationId><customerLocationName>Example's Hospital</customerLocationName><customerLocationNotes></customerLocationNotes><customerLocationPostalCode>W2 1NY</customerLocationPostalCode><customerLocationState>Greater London</customerLocationState><customerLocationStreetAddress>Example Street</customerLocationStreetAddress><customerLocationZone></customerLocationZone><customerName>Example  Healthcare (EXAMPLE)</customerName><dateTimeCreated>2010-04-06T09:47:25</dateTimeCreated><dateTimeClosed>2011-05-24T07:32:05.240</dateTimeClosed><description>Example - Ex/CANCELED</description><detailedDescription></detailedDescription><priority>3</priority><priorityLabel>Medium</priorityLabel><serviceManagerId>0</serviceManagerId><serviceRequestId>1007</serviceRequestId><status>Closed</status><timeOpen_hours>9909.7500000</timeOpen_hours><type></type></serviceRequest><serviceRequest><accountManagerId>11111</accountManagerId><billable>0</billable><billableTotal>0.0000000000</billableTotal><billingStatus>Not Billed</billingStatus><costTotal>0.0000</costTotal><customerContactEmail>example.example@gstt.nhs.uk, example2.example2@gstt.nhs.uk</customerContactEmail><customerContactId>5555555</customerContactId><customerContactName>Ex Example</customerContactName><customerContactPhone>88888 444444</customerContactPhone><customerContactPhoneMobile>07817 738912</customerContactPhoneMobile><customerId>957056</customerId><customerLocationCity>London</customerLocationCity><customerLocationCountry>United Kingdom</customerLocationCountry><customerLocationId>1372407</customerLocationId><customerLocationName>St Thomas' Hospital</customerLocationName><customerLocationNotes></customerLocationNotes><customerLocationPostalCode>SE1 7EH</customerLocationPostalCode><customerLocationState>Greater London</customerLocationState><customerLocationStreetAddress>Example Bridge Road</customerLocationStreetAddress><customerLocationZone></customerLocationZone><customerName>Examples'  Trust (EXTT)</customerName><dateTimeCreated>2010-06-10T07:37:58</dateTimeCreated><dateTimeClosed>2010-06-10T07:42:40</dateTimeClosed><description>Software  - EXAMPLE - 65463</description><detailedDescription>The example that I have created.&#x0D;
This is an example, I made up the data.&#x0D;
This is another line for the example. &#x0D;
</detailedDescription><priority>3</priority><priorityLabel>Medium</priorityLabel><serviceManagerId>0</serviceManagerId><serviceRequestId>6007</serviceRequestId><status>Closed</status><timeOpen_hours>0.0833000</timeOpen_hours><type>Problem</type></serviceRequest></serviceRequestList></response>"

How can I parse the example above? I tried

data <- htmlTreeParse("Sample.txt")
data

And got the results below:

$file
[1] "Sample.txt"

$version
[1] ""

$children
$children$html
<html>
 <body>
  <p>
   &quot;x&quot;
&quot;1&quot; &quot;
   <?xml version=\"1.0\" ??>
   <response status="ok">
    <servicerequestlist>
     <servicerequest>
      <accountmanagerid>111111</accountmanagerid>
...
...
 </servicerequest>
    </servicerequestlist>
   </response>
   &quot;
  </p>
 </body>
</html>


attr(,"class")
[1] "XMLDocumentContent"

system · March 19, 2021, 5:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.