Using the R programming language, I figured out how to import every PDF file (note: these are scanned PDF's) from a folder into R:
#load libraries
library(pdftools)
library(tesseract)
#location of the files
directory <- "C:/Users/Documents/files_i_want"
file.list <- paste(directory, "/",list.files(directory, pattern = "*.pdf"), sep = "")
#obtain the name of all the files
b = lapply(file.list, FUN = function(files) {
pdf_convert(files, format = "jpeg")
})
#store them in a data frame
a = data.frame(file.list)
As a result, the object "a" looks like this:
>head(a)
file.list
1 C:/Users/Documents/files_i_want/1_a.pdf
2 C:/Users/Documents/files_i_want/a_1.pdf
>dim(a)
[1] 2 1
> str(a)
'data.frame': 2 obs. of 1 variable:
$ file.list: chr "C:/Users/Documents/files_i_want/1_a.pdf" "C:/Users/Documents/files_i_want/a_1.pdf"
Now, I am trying to perform the following "operation" on each element within "a":
convert_function <- function(i){
text_i <- tesseract::ocr(a$i)
}
I am new to applying functions in R - I don't think I am doing this correctly. The goal would be to create objects "text_1" and "text_2" . Here is my attempt at doing this:
output <- apply(a,1,
FUN=function(x){
do.call(
convert_function,
)
}
)
Can someone please show me how to fix this problem?
Thanks