For loop or other possible functions to repeat the frequency counting of the particular tags (features) in multiple XML texts

Sangeun · January 18, 2022, 1:58pm

Hi, R community! Thanks in advance for considering my enquiry.

I want to write a script that does the following job below.

Input files: 1115 XML files saved in my directory (tagged for 39 features)
Operation: repeat counting each of 46 features in each XML file
Output: a frequency table looking like the one below

So far, I have the script below using the package ‘xml2’. The script both finds each of the features and tells me how many times they appear. For example, in the script below, I wanted to count the frequency of ‘tag1’ in the XML file named as ‘1_1’. The ‘tag1’ feature appears within dependency sections in the XML files.

setwd('~/my directory/')

install.packages('xml2')
library(xml2)

text <- read_xml(x = '1_1.xml')

# find dependency sections
dependencies <- xml_find_all(text, './/dependencies')

# find '<dep...> tags
deps <- xml_find_all(collapsed, './/dep')

# find tags for the features you are interested in, e.g.: ‘tag1’
tag1 <- deps[grep('type="tag1"', deps)]
N_tag1- length(tag1)

The code above needs improvement for two aspects.
First of all, the code above does not repeat the job for each of the 39 tags (e.g., 'tag1’) in 1,115 XML texts.
So the code needs to be written so that R repeats counting each of 39 tags in each of 1,115 XML texts at one go or a much-reduced number of scripts than the script above.
I think that the ‘For loop’ function might do the job, but I don’t know how to rewrite the script using it.

Second, the frequencies of each tag (e.g., ‘tag1’) needs to be assembled in a frequency table as an output. I have no idea about the function that can do that once the frequencies of the tags are counted in all the texts.

Any suggestions will be much appreciated. Thanks for reading this question

technocrat · January 19, 2022, 6:48pm

See the FAQ: How to do a minimal reproducible example reprex for beginners. Questions that don't require creating a dataset to test on usually receive more and more specific answers.

gsapijaszko · January 19, 2022, 7:28pm

Let's create frequency table first:

freqTable <- data.frame(matrix(NA, ncol=40, nrow=0))
a <- c("Filename")
for(i in 1:39) {
 a <- append(a, paste0("tag",i))
}
colnames(freqTable) <- a
freqTable[, 2:40] <- sapply(freqTable[, 2:40], as.numeric)

Let's list all files in folder:

myFiles <- list.files(path = "~/my directory", full.names = FALSE, pattern=".xml")

Let's create a loop across the files:

for (i in 1:length(myFiles)) {
 print(myFiles[i])
 # lets write filename to 1st column of freqTable
 freqTable[i,1] <- myFiles[i]

 # here you have to do the stuff which you like with the file
 #  
 text <- read_xml(x = myFiles[i])
 # find dependency sections
 dependencies <- xml_find_all(text, './/dependencies')
 # find '<dep...> tags
 deps <- xml_find_all(collapsed, './/dep')
 # find tags for the features you are interested in, e.g.: ‘tag1’
 # as there is 39 such tags, lets create another loop
 for(j in 1:39) {
  MySuperTag <- deps[grep('type="paste0("tag",i)"', deps)]
 # and update the value of the corresponding cell in freqTable
 freqTable[i,j+1] <- length(MySuperTag)
 }
}

Not tested, but you got the idea. There has to be two loops, one for files, the second one for tags.
Instead of updating single cells in frequency table you can collect them in tibble and rbind, or use dplyr:add_rows(), whatever.

Hope it helps,
Grzegorz

Sangeun · January 20, 2022, 12:01am

Hi Grzegorz,

Thank you so much for your kind help, which works beautifully.
I do have one more issue, though.
How can I list all 39 tag names rather than tag1, tag2, and so on?
Actually, the tags are like acomp, advmod, and so on. I just wrote them as tag1, tag2 for simplicity
It seems that the following bit of code needs to be changed, but I am making errors rewriting that bit.

paste0("tag",I)

My rewritten script is as below. Thank you so much in advance

library(xml2)

tags <- read.csv('~/39 measures.csv')
tags <- stringi::stri_c(tags)
tags

freqTable <- data.frame(matrix(NA, ncol=40, nrow=0))

colnames(freqTable) <- tags

freqTable
freqTable[, 2:40] <- sapply(freqTable[,2:40],as.numeric)

myFiles <- list.files(path="my directory", full.names=FALSE, pattern=".xml")

for (i in 1:length(myFiles)) {
  print(myFiles[i])
  freqTable[i,1] <- myFiles[1]
  text <- read_xml(x=myFiles[i])
  dependencies <- xml_find_all(text, './/dependencies')
  collapsed <- dependencies[grep('collapsed-dependencies',dependencies)]
  deps <- xml_find_all(collapsed, './/dep')
  for(j in 1:39) {
    MySuperTag <- deps[grep('type="paste0(tags)"', deps)]
freqTable[i,j+1] <- length(MySuperTag)
      }
}```

Sangeun · January 20, 2022, 12:06am

Thanks, I will keep that in mind and definitely try that next time!

gsapijaszko · January 20, 2022, 9:04am

Around here you can search for all attributes:

library(purrr)
deps %>% 
  purrr::map(~names(xml_attrs(.))) %>%
  unlist() %>% 
  unique()

it will create a vector of attr names. Then iterate it.
Or better, create it on beginning, assign to variable and iterate through it.

Regards,
Grzegorz

Sangeun · January 20, 2022, 9:21am

Thank you so much! Your demonstration is super clear. Hope you have a great day

best regards,
Sangeun

system · January 27, 2022, 9:21am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.