Extract data from pdf and NAs

sluyet · April 19, 2021, 9:19am

Dear community members,

I try to extract data using the package pdftools from the following PDF: https://www.fmh.ch/files/pdf5/stat1996.pdf. I only need page 9 (and 10: but for this one I need to explore precisely the useful data).

I use the explanations found on the following sites: How to extract tabular data from PDFs with R | Datajournalism How-To's, Getting data from pdfs using the pdftools package and https://medium.com/swlh/the-adventure-of-pdf-to-data-frame-in-r-f90609035600.

Here is my code:

library(pdftools)
library(tidyverse)

cantons <- c("ZH", "BE", "LU", "UR", "SZ", "OW", "NW", "GL", "ZG", "FR", "SO", "BS", "BL", "SH", "AR", "AI", "SG", "GR", "AG", "TG", "TI", "VD", "VS", "NE", "GE", "JU", "Total")
med_1996 <- "https://www.fmh.ch/files/pdf5/stat1996.pdf"
dta_med_1996 <- pdf_text(med_1996)[9] %>% 
  str_split("\n", simplify = TRUE) 
dta_med_1996 <- dta_med_1996[-1:-3]
dta_med_1996 <- dta_med_1996[-57] 

names_ex = list()
for(i in 1:length(dta_med_1996)) {
  words <- dta_med_1996[[i]] %>% 
    str_squish() %>%
    str_extract(".*[:alpha:]+|\\&|\\-") 
  words_df <- data.frame(words) 
  names_ex[[i]] <- words_df
  NH_names <- dplyr::bind_rows(names_ex)
}
print(NH_names)

numbers_ex = list()
k=1
for(i in 1:length(dta_med_1996)) {
  for(j in 1:length(dta_med_1996[[i]])){
    numbers <- dta_med_1996[[i]][j] %>% 
      str_extract("[:digit:]+.*")
    numbers_df <- data.frame(numbers) 
    while(k <= 1000) {
      numbers_ex[[k]]<- numbers_df 
      k <- k+1
      break
    }
  }
  NH_numbers <- dplyr::bind_rows(numbers_ex) 
}

data_1996 <- cbind(NH_names, NH_numbers)
View(data_1996)

Here is my output (only the 5 first rows):

1                                                 Allgemeinmedizin / Médecine générale 479   417   145  15  45  11  14  16  33  58  98  68 104  36  24  3 163  94 174  91 103 173  97  40  85  12      2598
2                                     spez. Arbeitsmedizin / spéc. médecine du travail 3      3    3                               1   5   3                      2   1   1   6                         28
3                                     Total Allgemeinmedizin / total médecine générale 482   420   148  15  45  11  14  16  33  58  99  73 107  36  24  3 163  94 176  92 104 179  97  40  85  12      2626
4                                                    Anästhesiologie / Anesthésiologie 101     91   20   2   3   1   2   2   9  21   7  24  19   2   4     23  12  33   3  21  64  15  13  36   5       533
5                                     spez. Arbeitsmedizin / spéc. médecine du travail                                                                              1                                      1

As you see, there are lots of blank spaces. What I want is the have NAs instead of blank spaces, but I can't find a suitable solution. At this point the best I can come up with is as follow:

data_1996 <- cbind(NH_names, NH_numbers) %>%
  mutate(numbers = str_replace_all(numbers, "\\s", "NA")) %>%
  separate(numbers, cantons, sep = "\\NA", extra = "drop")
View(data_1996)

which gives the following output (only the 5 first rows):

1                                                 Allgemeinmedizin / Médecine générale 479       417       145    15 45    11    14    16    33    58    98    68 104      
2                                     spez. Arbeitsmedizin / spéc. médecine du travail   3                   3 3                                                         
3                                     Total Allgemeinmedizin / total médecine générale 482       420       148    15 45    11    14    16    33    58    99    73 107      
4                                                    Anästhesiologie / Anesthésiologie 101              91        20 2        3        1        2        2            9
5                                     spez. Arbeitsmedizin / spéc. médecine du travail   1

This piece of code ordered the dataset well (by cantons), but instead of filling the right cells with NAs, it only adds new blank cells. Plus, I lost some important information (for example: the total number of physicians).

Any help would be greatly appreciated!

Many thanks!

Jwvz001 · April 21, 2021, 8:59pm

How about if you replace this structure by read_lines(na = character() ?

JW

sluyet · April 22, 2021, 4:45am

Dear JW,

Thanks so much for your help!
However with your proposition, I obtain exactly the same outputs. I think the problem lies in this piece of code:

data_1996 <- cbind(NH_names, NH_numbers) %>%
  mutate(numbers = str_replace_all(numbers, "\\s", "NA")) %>%
  separate(numbers, cantons, sep = "\\NA", extra = "drop")
View(data_1996)

Any other solution? Once again, any help would be greatly appreciated!

Jwvz001 · April 22, 2021, 9:13am

Hi,

I had a quick look at the site and the table you want to extract. It's not a nice dataframe Whenever I extract pdf-docs, I use the tabulizer::extract_tables(), which at least maintains a tabular structure. It coerces into a list here, because of the first 2 rows. The package has also extract_area()-function which gives more precise control. Have a look. With dplyr::na_if() I was able retrieve the missings as 'NA'. Could you have a look if this brings you any further? I generally try to coerce 'NA' upstream in my workflow.

The first 2 rows should be treated separately , as they are headers.
Hope this helps you further,

JW

library(pdftools)
#> Warning: package 'pdftools' was built under R version 4.0.3
#> Using poppler version 0.73.0
library(tabulizer)
#> Warning: package 'tabulizer' was built under R version 4.0.3
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.0.3
#> Warning: package 'tibble' was built under R version 4.0.4
#> Warning: package 'tidyr' was built under R version 4.0.4
#> Warning: package 'readr' was built under R version 4.0.3
#> Warning: package 'dplyr' was built under R version 4.0.4
#> Warning: package 'forcats' was built under R version 4.0.3

med_1996 <- "https://www.fmh.ch/files/pdf5/stat1996.pdf"

my_dta_med_1996 <-as_tibble(extract_tables(med_1996, pages = 9, encoding = "UTF-8")[[1]]) %>%
                    mutate(across(where(is.character), ~na_if(.,"")))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
(my_dta_med_1996)                   
#> # A tibble: 58 x 28
#>    V1    V2    V3    V4    V5    V6    V7    V8    V9    V10   V11   V12   V13  
#>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 5. B~  <NA> <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  2 Fach~ "Kan~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  3 Allg~ "479" 417   145   15    45    11    14    16    33    58    98    68   
#>  4 spez~ "3"   3     3     <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  1     5    
#>  5 Tota~ "482" 420   148   15    45    11    14    16    33    58    99    73   
#>  6 Anäs~ "101" 91    20    2     3     1     2     2     9     21    7     24   
#>  7 spez~  <NA> <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  8 spez~ "22"  12    4     <NA>  1     <NA>  <NA>  <NA>  3     4     4     11   
#>  9 spez~ "1"   1     <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 10 Tota~ "124" 104   24    2     4     1     2     2     12    25    11    35   
#> # ... with 48 more rows, and 15 more variables: V14 <chr>, V15 <chr>,
#> #   V16 <chr>, V17 <chr>, V18 <chr>, V19 <chr>, V20 <chr>, V21 <chr>,
#> #   V22 <chr>, V23 <chr>, V24 <chr>, V25 <chr>, V26 <chr>, V27 <chr>, V28 <chr>

sluyet · April 23, 2021, 6:30am

Hi,

Manx many thanks for the help! It works just fine.

I know it's an awful dataframe. I asked the organization to provide me with the data (at least in xlsx format), but they did not follow up, the reason being that all the data is available online (but very difficult to work with).

With the following lines of code, I could remove the problematic lines:

dta_med_1996 <- dta_med_1996[-1:-3]
dta_med_1996 <- dta_med_1996[-57]

As your solution is better, I will rely on the tabulizer package.

As an edit, in order to have the headers right, I have integrated the name of the variable "Spécialité" in the list of cantons:

cantons <- c("Spécialité", "ZH", "BE", "LU", "UR", "SZ", "OW", "NW", "GL", "ZG", "FR", "SO", "BS", "BL", "SH", "AR", "AI", "SG", "GR", "AG", "TG", "TI", "VD", "VS", "NE", "GE", "JU", "Total")

Then I drop the first 2 rows (as you mentioned they should be treated separately). So the whole code to extract the data from the pdf:

library(pdftools)
library(tidyverse)
library(tabulizer)

cantons <- c("Spécialités", "ZH", "BE", "LU", "UR", "SZ", "OW", "NW", "GL", "ZG", "FR", "SO", "BS", "BL", "SH", "AR", 
             "AI", "SG", "GR", "AG", "TG", "TI", "VD", "VS", "NE", "GE", "JU", "Total")

med_1996 <- "https://www.fmh.ch/files/pdf5/stat1996.pdf"

dta_med_1996 <- as_tibble(extract_tables(med_1996, pages = 9, encoding = "UTF-8")[[1]]) %>%
  mutate(across(where(is.character), ~na_if(.,""))) 
dta_med_1996 <- dta_med_1996 %>%
  `names<-`(cantons) %>%  #insert new headers
  slice(3:58) #drop the first 2 rows

which gives the following output:

  Spécialités                                                ZH    BE    LU    UR    SZ    OW    NW    GL    ZG    FR    SO    BS    BL    SH    AR    AI    SG    GR    AG    TG    TI    VD    VS    NE    GE    JU    Total
   <chr>                                                      <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 Allgemeinmedizin / Médecine générale                       479   417   145   15    45    11    14    16    33    58    98    68    104   36    24    3     163   94    174   91    103   173   97    40    85    12    2598 
 2 spez. Arbeitsmedizin / spéc. médecine du travail           3     3     3     NA    NA    NA    NA    NA    NA    NA    1     5     3     NA    NA    NA    NA    NA    2     1     1     6     NA    NA    NA    NA    28   
 3 Total Allgemeinmedizin / total médecine générale           482   420   148   15    45    11    14    16    33    58    99    73    107   36    24    3     163   94    176   92    104   179   97    40    85    12    2626 
 4 Anästhesiologie / Anesthésiologie                          101   91    20    2     3     1     2     2     9     21    7     24    19    2     4     NA    23    12    33    3     21    64    15    13    36    5     533

Once again, many thanks for the help and the solution!

Best regards,
SL

Jwvz001 · April 23, 2021, 10:54am

Hi @sluyet,

Glad to be of help! For the convenience of others and later reference, would you tick it as solution?

Good luck with the table.

JW

system · April 30, 2021, 10:55am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.