Dear community members,
I try to extract data using the package pdftools from the following PDF: https://www.fmh.ch/files/pdf5/stat1996.pdf. I only need page 9 (and 10: but for this one I need to explore precisely the useful data).
I use the explanations found on the following sites: How to extract tabular data from PDFs with R | Datajournalism How-To's, Getting data from pdfs using the pdftools package and https://medium.com/swlh/the-adventure-of-pdf-to-data-frame-in-r-f90609035600.
Here is my code:
library(pdftools)
library(tidyverse)
cantons <- c("ZH", "BE", "LU", "UR", "SZ", "OW", "NW", "GL", "ZG", "FR", "SO", "BS", "BL", "SH", "AR", "AI", "SG", "GR", "AG", "TG", "TI", "VD", "VS", "NE", "GE", "JU", "Total")
med_1996 <- "https://www.fmh.ch/files/pdf5/stat1996.pdf"
dta_med_1996 <- pdf_text(med_1996)[9] %>%
str_split("\n", simplify = TRUE)
dta_med_1996 <- dta_med_1996[-1:-3]
dta_med_1996 <- dta_med_1996[-57]
names_ex = list()
for(i in 1:length(dta_med_1996)) {
words <- dta_med_1996[[i]] %>%
str_squish() %>%
str_extract(".*[:alpha:]+|\\&|\\-")
words_df <- data.frame(words)
names_ex[[i]] <- words_df
NH_names <- dplyr::bind_rows(names_ex)
}
print(NH_names)
numbers_ex = list()
k=1
for(i in 1:length(dta_med_1996)) {
for(j in 1:length(dta_med_1996[[i]])){
numbers <- dta_med_1996[[i]][j] %>%
str_extract("[:digit:]+.*")
numbers_df <- data.frame(numbers)
while(k <= 1000) {
numbers_ex[[k]]<- numbers_df
k <- k+1
break
}
}
NH_numbers <- dplyr::bind_rows(numbers_ex)
}
data_1996 <- cbind(NH_names, NH_numbers)
View(data_1996)
Here is my output (only the 5 first rows):
1 Allgemeinmedizin / Médecine générale 479 417 145 15 45 11 14 16 33 58 98 68 104 36 24 3 163 94 174 91 103 173 97 40 85 12 2598
2 spez. Arbeitsmedizin / spéc. médecine du travail 3 3 3 1 5 3 2 1 1 6 28
3 Total Allgemeinmedizin / total médecine générale 482 420 148 15 45 11 14 16 33 58 99 73 107 36 24 3 163 94 176 92 104 179 97 40 85 12 2626
4 Anästhesiologie / Anesthésiologie 101 91 20 2 3 1 2 2 9 21 7 24 19 2 4 23 12 33 3 21 64 15 13 36 5 533
5 spez. Arbeitsmedizin / spéc. médecine du travail 1 1
As you see, there are lots of blank spaces. What I want is the have NAs instead of blank spaces, but I can't find a suitable solution. At this point the best I can come up with is as follow:
data_1996 <- cbind(NH_names, NH_numbers) %>%
mutate(numbers = str_replace_all(numbers, "\\s", "NA")) %>%
separate(numbers, cantons, sep = "\\NA", extra = "drop")
View(data_1996)
which gives the following output (only the 5 first rows):
1 Allgemeinmedizin / Médecine générale 479 417 145 15 45 11 14 16 33 58 98 68 104
2 spez. Arbeitsmedizin / spéc. médecine du travail 3 3 3
3 Total Allgemeinmedizin / total médecine générale 482 420 148 15 45 11 14 16 33 58 99 73 107
4 Anästhesiologie / Anesthésiologie 101 91 20 2 3 1 2 2 9
5 spez. Arbeitsmedizin / spéc. médecine du travail 1
This piece of code ordered the dataset well (by cantons), but instead of filling the right cells with NAs, it only adds new blank cells. Plus, I lost some important information (for example: the total number of physicians).
Any help would be greatly appreciated!
Many thanks!