Learning how to "perform an operation" on each "element" within a data frame

Using the R programming language, I figured out how to import every PDF file (note: these are scanned PDF's) from a folder into R:

#load libraries
library(pdftools)
library(tesseract)

#location of the files
directory <- "C:/Users/Documents/files_i_want"

file.list <- paste(directory, "/",list.files(directory, pattern = "*.pdf"), sep = "")
#obtain the name of all the files
b = lapply(file.list, FUN = function(files) {
    pdf_convert(files, format = "jpeg")
})
#store them in a data frame
a = data.frame(file.list)

As a result, the object "a" looks like this:

 >head(a)
                                               file.list
1 C:/Users/Documents/files_i_want/1_a.pdf
2 C:/Users/Documents/files_i_want/a_1.pdf

>dim(a)
[1] 2 1

> str(a)
'data.frame':	2 obs. of  1 variable:
 $ file.list: chr  "C:/Users/Documents/files_i_want/1_a.pdf" "C:/Users/Documents/files_i_want/a_1.pdf"

Now, I am trying to perform the following "operation" on each element within "a":

convert_function <- function(i){
text_i <- tesseract::ocr(a$i)
}

I am new to applying functions in R - I don't think I am doing this correctly. The goal would be to create objects "text_1" and "text_2" . Here is my attempt at doing this:

output <- apply(a,1, 
                   FUN=function(x){
                       do.call(
                          
                          convert_function,
                          )
                   }
)

Can someone please show me how to fix this problem?

Thanks

Hi,

I don't see why you first want to convert a list to a data frame to then apply something over every element in that dataframe (essentially converting it back to a list :slight_smile: )

So this step:

a = data.frame(file.list)

Is not needed and you can immediately apply you function over the file.list itself

file.list = c("file1.pdf", "file2.pdf", "file3.pdf")

lapply(file.list, function(file){
  
  #Dummy function
  setNames(
    paste(sample(c(" ", LETTERS[1:10]), 50, replace = T), collapse = ""), 
    file)
  
  # Your function
  # tesseract::ocr(file)
  
})
#> [[1]]
#>                                            file1.pdf 
#> "FJEDHG GEHFEJECHIICDIIAHGIBJHEE DBDEAH FDCBBDCADBF" 
#> 
#> [[2]]
#>                                            file2.pdf 
#> "AHBB DGA BIJHACCDGEEHIDECDE DIDDAF  FBFCEDHDIAGHDF" 
#> 
#> [[3]]
#>                                            file3.pdf 
#> "FGDAFACG  JIEAHI EAJAJJCEAEGFJADD F FGJDDI  HDFBCF"

Created on 2021-08-01 by the reprex package (v2.0.0)

The apply function will take every element of a list (regardless of its type) and pass it to a function. In this case, every file name in file.list will be passed to the custom function with argument file.

Hope this helps,
PJ

1 Like

thank you for your answer! I ended up solving this using the built-in "for loops" - would you like to see it?

Hi again,

You're welcome.
It's always nice to share the answer to your questions, even if you came up with it yourself :slight_smile: It might help someone else in the future...

PJ

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.