extract tables from a pdf

juandmaz · July 10, 2025, 8:40pm

Hi, I have a PDF with several tables and I’d like to extract them and work with them in R.
The problem is that the PDF page I need contains many tables (in fact, I had to crop the image to make it clearly visible), and it's difficult for me to extract them in R.

This is the code I’m using and this is the output.

> f<-(tabulapdf::extract_tables("C:/Users/Juan/Desktop/proyecciones_prov_2010_2040.pdf", pages = 105))
New names:                                                                                                      
• `` -> `...1`
• `` -> `...2`
• `` -> `...4`
• `` -> `...5`
• `` -> `...7`
• `` -> `...8`
• `` -> `...10`

> print(f[[1]], n = Inf)
# A tibble: 49 × 10
   ...1      ...2        `2010`  ...4    ...5        `2011`  ...7    ...8        `2012`  ...10  
   <chr>     <chr>       <chr>   <chr>   <chr>       <chr>   <chr>   <chr>       <chr>   <chr>  
 1 Edad      NA          NA      NA      NA          NA      NA      NA          NA      NA     
 2 NA        Ambos secsos Varones Mujeres Ambos secsos Varones Mujeres Ambos secsos Varones Mujeres
 3 Total     683.513     336.954 346.559 692.379     341.398 350.981 701.252     345.849 355.403
 4 0-4       64.797      33.381  31.416  64.867      33.437  31.430  65.145      33.588  31.557 
 5 5-9       68.705      35.125  33.580  67.970      34.802  33.168  67.173      34.450  32.723 
 6 10-14     71.371      36.261  35.110  71.011      36.140  34.871  70.637      36.006  34.631 
 7 15-19     69.674      35.070  34.604  70.823      35.732  35.091  71.326      36.065  35.261 
 8 20-24     56.769      28.189  28.580  58.275      29.015  29.260  60.356      30.138  30.218 
 9 25-29     53.661      26.443  27.218  53.198      26.179  27.019  52.857      25.996  26.861 
10 30-34     53.401      26.298  27.103  53.951      26.575  27.376  54.059      26.617  27.442 
11 35-39     45.319      22.146  23.173  47.304      23.165  24.139  49.228      24.153  25.075 
12 40-44     37.162      17.942  19.220  38.408      18.576  19.832  39.881      19.329  20.552 
13 45-49     33.457      16.015  17.442  33.916      16.242  17.674  34.431      16.496  17.935 
14 50-54     30.572      14.511  16.061  30.997      14.695  16.302  31.444      14.905  16.539 
15 55-59     27.305      13.061  14.244  27.827      13.258  14.569  28.286      13.416  14.870 
16 60-64     22.371      10.601  11.770  23.126      10.957  12.169  23.868      11.309  12.559 
17 65-69     17.234      8.100   9.134   17.855      8.359   9.496   18.519      8.640   9.879  
18 70-74     12.994      6.011   6.983   13.363      6.153   7.210   13.773      6.314   7.459  
19 75-79     9.320       4.154   5.166   9.601       4.285   5.316   9.872       4.403   5.469  
20 80-84     5.613       2.310   3.303   5.847       2.409   3.438   6.096       2.520   3.576  
21 85-89     2.661       1.001   1.660   2.812       1.048   1.764   2.964       1.096   1.868  
22 90-94     872         269     603     964         306     658     1.055       339     716    
23 95-99     216         54      162     215         51      164     227         54      173    
24 100 y más 39          12      27      49          14      35      55          15      40     
25 NA        NA          2013    NA      NA          2014    NA      NA          2015    NA     
26 Edad      NA          NA      NA      NA          NA      NA      NA          NA      NA     
27 NA        Ambos secsos Varones Mujeres Ambos secsos Varones Mujeres Ambos secsos Varones Mujeres
28 Total     710.121     350.301 359.820 718.971     354.747 364.224 727.780     359.175 368.605
29 0-4       65.579      33.809  31.770  66.140      34.085  32.055  66.721      34.361  32.360 
30 5-9       66.390      34.106  32.284  65.671      33.793  31.878  65.118      33.566  31.552 
31 10-14     70.221      35.845  34.376  69.701      35.630  34.071  69.077      35.359  33.718 
32 15-19     71.347      36.152  35.195  71.110      36.104  35.006  70.768      35.999  34.769 
33 20-24     62.697      31.398  31.299  64.922      32.603  32.319  66.709      33.587  33.122 
34 25-29     52.760      25.957  26.803  53.074      26.146  26.928  53.930      26.629  27.301 
35 30-34     53.852      26.488  27.364  53.459      26.257  27.202  52.989      25.982  27.007 
36 35-39     50.977      25.051  25.926  52.416      25.790  26.626  53.452      26.324  27.128 
37 40-44     41.547      20.181  21.366  43.359      21.111  22.248  45.290      22.099  23.191 
38 45-49     35.051      16.804  18.247  35.841      17.199  18.642  36.839      17.705  19.134 
39 50-54     31.904      15.133  16.771  32.367      15.368  16.999  32.827      15.598  17.229 
40 55-59     28.708      13.556  15.152  29.118      13.698  15.420  29.535      13.863  15.672 
41 60-64     24.580      11.641  12.939  25.237      11.935  13.302  25.828      12.181  13.647 
42 65-69     19.211      8.940   10.271  19.927      9.260   10.667  20.654      9.592   11.062 
43 70-74     14.222      6.494   7.728   14.722      6.694   8.028   15.269      6.920   8.349  
44 75-79     10.153      4.517   5.636   10.445      4.633   5.812   10.764      4.754   6.010  
45 80-84     6.348       2.635   3.713   6.604       2.752   3.852   6.848       2.863   3.985  
46 85-89     3.121       1.148   1.973   3.281       1.203   2.078   3.455       1.267   2.188  
47 90-94     1.146       373     773     1.233       400     833     1.323       428     895    
48 95-99     251         60      191     286         73      213     323         85      238    
49 100 y más 56          13      43      58          13      45      61          13      48

As you can see, on line 25 new columns appear, but they are incorrectly imported by R since it doesn't detect them as actual columns.
What I need is to have the 'total' column and then one column per year only for the section that says 'mujeres', not the other two sections. This is my ideal output that I built by hand and would like to have with R:

df %>%
head(15)
# A tibble: 15 × 7
   Total   `2010` `2011` `2012` `2013` `2014` `2015`
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 0 a 4    31416  31430  31557  31770  32055  32360
 2 5 a 9    33580  33168  32723  32284  31878  31552
 3 10 a 14  35110  34871  34631  34376  34071  33718
 4 15 a 19  34604  35091  35261  35195  35006  34769
 5 20 a 24  28580  29260  30218  31299  32319  33122
 6 25 a 29  27218  27019  26861  26803  26928  27301
 7 30 a 34  27103  27376  27442  27364  27202  27007
 8 35 a 39  23173  24139  25075  25926  26626  27128
 9 40 a 44  19220  19832  20552  21366  22248  23191
10 45 a 49  17442  17674  17935  18247  18642  19134
11 50 a 54  16061  16302  16539  16771  16999  17229
12 55 a 59  14244  14569  14870  15152  15420  15672
13 60 a 64  11770  12169  12559  12939  13302  13647
14 65 a 69   9134   9496   9879  10271  10667  11062
15 70 a 74   6983   7210   7459   7728   8028   8349

PD: had to wrote 'secso' cause Posit doesn't let me write the correct word
EDIT: Sorry, I just realized that the output I posted doesn’t clearly show the issue. I’ve edited it now so my difficulties can be better understood.

mduvekot · July 10, 2025, 9:44pm

that should be pretty straightforward to clean up

df |> dplyr::select(`...1`, `2010`, `2011`, `2012`) |> 
  utils::tail(-3) |> 
  dplyr::rename ("Edad" = `...1`)|> 
  dplyr::mutate(dplyr::across(-Edad, as.numeric))

gives:

# A tibble: 7 × 4
  Edad  `2010` `2011` `2012`
  <chr>  <dbl>  <dbl>  <dbl>
1 0-4     33.4   33.4   65.1
2 5-9     35.1   34.8   67.2
3 10-14   36.3   36.1   70.6
4 15-19   35.1   35.7   71.3
5 20-24   28.2   29.0   60.4
6 25-29   26.4   26.2   52.9
7 30-34   26.3   26.6   54.1

juandmaz · July 11, 2025, 4:15am

orry, I just realized that the output I posted doesn’t clearly show the issue. I’ve edited it now so my difficulties can be better understood.

mduvekot · July 11, 2025, 1:10pm

Then I don't understand what the issue is. Can you try an explain it?

AlexisW · July 11, 2025, 4:11pm

It takes work to make sure it's processed correctly. Here is a quick try that might give you some ideas:

library(tidyverse)

f <- "...1  ...2    2010    ...4    ...5    2011    ...7    ...8    2012    ...10
Edad    NA  NA  NA  NA  NA  NA  NA  NA  NA
NA  Ambos secsos    Varones Mujeres Ambos secsos    Varones Mujeres Ambos secsos    Varones Mujeres
Total   683.513 336.954 346.559 692.379 341.398 350.981 701.252 345.849 355.403
0-4 64.797  33.381  31.416  64.867  33.437  31.430  65.145  33.588  31.557
5-9 68.705  35.125  33.580  67.970  34.802  33.168  67.173  34.450  32.723
10-14   71.371  36.261  35.110  71.011  36.140  34.871  70.637  36.006  34.631
15-19   69.674  35.070  34.604  70.823  35.732  35.091  71.326  36.065  35.261
20-24   56.769  28.189  28.580  58.275  29.015  29.260  60.356  30.138  30.218
25-29   53.661  26.443  27.218  53.198  26.179  27.019  52.857  25.996  26.861
30-34   53.401  26.298  27.103  53.951  26.575  27.376  54.059  26.617  27.442
35-39   45.319  22.146  23.173  47.304  23.165  24.139  49.228  24.153  25.075
40-44   37.162  17.942  19.220  38.408  18.576  19.832  39.881  19.329  20.552
45-49   33.457  16.015  17.442  33.916  16.242  17.674  34.431  16.496  17.935
50-54   30.572  14.511  16.061  30.997  14.695  16.302  31.444  14.905  16.539
55-59   27.305  13.061  14.244  27.827  13.258  14.569  28.286  13.416  14.870
60-64   22.371  10.601  11.770  23.126  10.957  12.169  23.868  11.309  12.559
65-69   17.234  8.100   9.134   17.855  8.359   9.496   18.519  8.640   9.879
70-74   12.994  6.011   6.983   13.363  6.153   7.210   13.773  6.314   7.459
75-79   9.320   4.154   5.166   9.601   4.285   5.316   9.872   4.403   5.469
80-84   5.613   2.310   3.303   5.847   2.409   3.438   6.096   2.520   3.576
85-89   2.661   1.001   1.660   2.812   1.048   1.764   2.964   1.096   1.868
90-94   872 269 603 964 306 658 1.055   339 716
95-99   216 54  162 215 51  164 227 54  173
100 y más   39  12  27  49  14  35  55  15  40
NA  NA  2013    NA  NA  2014    NA  NA  2015    NA
Edad    NA  NA  NA  NA  NA  NA  NA  NA  NA
NA  Ambos secsos    Varones Mujeres Ambos secsos    Varones Mujeres Ambos secsos    Varones Mujeres
Total   710.121 350.301 359.820 718.971 354.747 364.224 727.780 359.175 368.605
0-4 65.579  33.809  31.770  66.140  34.085  32.055  66.721  34.361  32.360
5-9 66.390  34.106  32.284  65.671  33.793  31.878  65.118  33.566  31.552
10-14   70.221  35.845  34.376  69.701  35.630  34.071  69.077  35.359  33.718
15-19   71.347  36.152  35.195  71.110  36.104  35.006  70.768  35.999  34.769
20-24   62.697  31.398  31.299  64.922  32.603  32.319  66.709  33.587  33.122
25-29   52.760  25.957  26.803  53.074  26.146  26.928  53.930  26.629  27.301
30-34   53.852  26.488  27.364  53.459  26.257  27.202  52.989  25.982  27.007
35-39   50.977  25.051  25.926  52.416  25.790  26.626  53.452  26.324  27.128
40-44   41.547  20.181  21.366  43.359  21.111  22.248  45.290  22.099  23.191
45-49   35.051  16.804  18.247  35.841  17.199  18.642  36.839  17.705  19.134
50-54   31.904  15.133  16.771  32.367  15.368  16.999  32.827  15.598  17.229
55-59   28.708  13.556  15.152  29.118  13.698  15.420  29.535  13.863  15.672
60-64   24.580  11.641  12.939  25.237  11.935  13.302  25.828  12.181  13.647
65-69   19.211  8.940   10.271  19.927  9.260   10.667  20.654  9.592   11.062
70-74   14.222  6.494   7.728   14.722  6.694   8.028   15.269  6.920   8.349
75-79   10.153  4.517   5.636   10.445  4.633   5.812   10.764  4.754   6.010
80-84   6.348   2.635   3.713   6.604   2.752   3.852   6.848   2.863   3.985
85-89   3.121   1.148   1.973   3.281   1.203   2.078   3.455   1.267   2.188
90-94   1.146   373 773 1.233   400 833 1.323   428 895
95-99   251 60  191 286 73  213 323 85  238
100 y más   56  13  43  58  13  45  61  13  48" |>
  read.table(text = _,
             header = TRUE,
             check.names = FALSE,
             sep = "\t") |>
  as_tibble()

head(f)
#> # A tibble: 6 × 10
#>   ...1  ...2         `2010`  ...4    ...5        `2011` ...7  ...8  `2012` ...10
#>   <chr> <chr>        <chr>   <chr>   <chr>       <chr>  <chr> <chr> <chr>  <chr>
#> 1 Edad  <NA>         <NA>    <NA>    <NA>        <NA>   <NA>  <NA>  <NA>   <NA> 
#> 2 <NA>  Ambos secsos Varones Mujeres Ambos secs… Varon… Muje… Ambo… Varon… Muje…
#> 3 Total 683.513      336.954 346.559 692.379     341.3… 350.… 701.… 345.8… 355.…
#> 4 0-4   64.797       33.381  31.416  64.867      33.437 31.4… 65.1… 33.588 31.5…
#> 5 5-9   68.705       35.125  33.580  67.970      34.802 33.1… 67.1… 34.450 32.7…
#> 6 10-14 71.371       36.261  35.110  71.011      36.140 34.8… 70.6… 36.006 34.6…
dim(f)
#> [1] 49 10

# put the titles as a row
f1 <- rbind(colnames(f), f |> set_names(seq_len(ncol(f))))

head(f1)
#> # A tibble: 6 × 10
#>   `1`   `2`          `3`     `4`     `5`          `6`    `7`   `8`   `9`   `10` 
#>   <chr> <chr>        <chr>   <chr>   <chr>        <chr>  <chr> <chr> <chr> <chr>
#> 1 ...1  ...2         2010    ...4    ...5         2011   ...7  ...8  2012  ...10
#> 2 Edad  <NA>         <NA>    <NA>    <NA>         <NA>   <NA>  <NA>  <NA>  <NA> 
#> 3 <NA>  Ambos secsos Varones Mujeres Ambos secsos Varon… Muje… Ambo… Varo… Muje…
#> 4 Total 683.513      336.954 346.559 692.379      341.3… 350.… 701.… 345.… 355.…
#> 5 0-4   64.797       33.381  31.416  64.867       33.437 31.4… 65.1… 33.5… 31.5…
#> 6 5-9   68.705       35.125  33.580  67.970       34.802 33.1… 67.1… 34.4… 32.7…
dim(f1)
#> [1] 50 10


# find the subsets, whose second row starts with "Edad"
blocks_second <- which(f1["1"] == "Edad")
blocks_start <- blocks_second - 1L
blocks_end <- c(
  blocks_start[-1] - 1L,
  nrow(f1)
)

blocks_start; blocks_end
#> [1]  1 26
#> [1] 25 50


# split subsets and combine horizontally (in columns)
stopifnot(length(blocks_start) == length(blocks_end))

f2 <- map_dfc(seq_along(blocks_start),
        \(i) f1[blocks_start[[i]]:blocks_end[[i]], ])
#> New names:
#> • `1` -> `1...1`
#> • `2` -> `2...2`
#> • `3` -> `3...3`
#> • `4` -> `4...4`
#> • `5` -> `5...5`
#> • `6` -> `6...6`
#> • `7` -> `7...7`
#> • `8` -> `8...8`
#> • `9` -> `9...9`
#> • `10` -> `10...10`
#> • `1` -> `1...11`
#> • `2` -> `2...12`
#> • `3` -> `3...13`
#> • `4` -> `4...14`
#> • `5` -> `5...15`
#> • `6` -> `6...16`
#> • `7` -> `7...17`
#> • `8` -> `8...18`
#> • `9` -> `9...19`
#> • `10` -> `10...20`

head(f2)
#> # A tibble: 6 × 20
#>   `1...1` `2...2`      `3...3` `4...4` `5...5`   `6...6` `7...7` `8...8` `9...9`
#>   <chr>   <chr>        <chr>   <chr>   <chr>     <chr>   <chr>   <chr>   <chr>  
#> 1 ...1    ...2         2010    ...4    ...5      2011    ...7    ...8    2012   
#> 2 Edad    <NA>         <NA>    <NA>    <NA>      <NA>    <NA>    <NA>    <NA>   
#> 3 <NA>    Ambos secsos Varones Mujeres Ambos se… Varones Mujeres Ambos … Varones
#> 4 Total   683.513      336.954 346.559 692.379   341.398 350.981 701.252 345.849
#> 5 0-4     64.797       33.381  31.416  64.867    33.437  31.430  65.145  33.588 
#> 6 5-9     68.705       35.125  33.580  67.970    34.802  33.168  67.173  34.450 
#> # ℹ 11 more variables: `10...10` <chr>, `1...11` <chr>, `2...12` <chr>,
#> #   `3...13` <chr>, `4...14` <chr>, `5...15` <chr>, `6...16` <chr>,
#> #   `7...17` <chr>, `8...18` <chr>, `9...19` <chr>, `10...20` <chr>
dim(f2)
#> [1] 25 20


# combine title rows

titles <- f2[1:3,]

titles_year <- titles[1,] |>
  as.character() |>
  (\(x) {x[startsWith(x, "...")] <- NA_character_ ; x})() |>
  enframe() |>
  fill(value, .direction = "down") |>
  pull(value) |>
  replace_na("")

titles_rewritten <- paste0(titles_year, "_", titles[3,])

f3 <- f2[-c(1:4),] |>
  set_names(titles_rewritten)

head(f3)
#> # A tibble: 6 × 20
#>   `_NA` `_Ambos secsos` `2010_Varones` `2010_Mujeres` `2010_Ambos secsos`
#>   <chr> <chr>           <chr>          <chr>          <chr>              
#> 1 0-4   64.797          33.381         31.416         64.867             
#> 2 5-9   68.705          35.125         33.580         67.970             
#> 3 10-14 71.371          36.261         35.110         71.011             
#> 4 15-19 69.674          35.070         34.604         70.823             
#> 5 20-24 56.769          28.189         28.580         58.275             
#> 6 25-29 53.661          26.443         27.218         53.198             
#> # ℹ 15 more variables: `2011_Varones` <chr>, `2011_Mujeres` <chr>,
#> #   `2011_Ambos secsos` <chr>, `2012_Varones` <chr>, `2012_Mujeres` <chr>,
#> #   `2012_NA` <chr>, `2012_Ambos secsos` <chr>, `2013_Varones` <chr>,
#> #   `2013_Mujeres` <chr>, `2013_Ambos secsos` <chr>, `2014_Varones` <chr>,
#> #   `2014_Mujeres` <chr>, `2014_Ambos secsos` <chr>, `2015_Varones` <chr>,
#> #   `2015_Mujeres` <chr>
dim(f3)
#> [1] 21 20

^{Created on 2025-07-11 with reprex v2.1.1}

juandmaz · July 11, 2025, 5:24pm

Yeah of course. on row 25 new columns appear, but they are incorrectly imported by R since it doesn't detect them as actual columns.