Hi, I have a PDF with several tables and I’d like to extract them and work with them in R.
The problem is that the PDF page I need contains many tables (in fact, I had to crop the image to make it clearly visible), and it's difficult for me to extract them in R.
This is the code I’m using and this is the output.
> f<-(tabulapdf::extract_tables("C:/Users/Juan/Desktop/proyecciones_prov_2010_2040.pdf", pages = 105))
New names:
• `` -> `...1`
• `` -> `...2`
• `` -> `...4`
• `` -> `...5`
• `` -> `...7`
• `` -> `...8`
• `` -> `...10`
> print(f[[1]], n = Inf)
# A tibble: 49 × 10
...1 ...2 `2010` ...4 ...5 `2011` ...7 ...8 `2012` ...10
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Edad NA NA NA NA NA NA NA NA NA
2 NA Ambos secsos Varones Mujeres Ambos secsos Varones Mujeres Ambos secsos Varones Mujeres
3 Total 683.513 336.954 346.559 692.379 341.398 350.981 701.252 345.849 355.403
4 0-4 64.797 33.381 31.416 64.867 33.437 31.430 65.145 33.588 31.557
5 5-9 68.705 35.125 33.580 67.970 34.802 33.168 67.173 34.450 32.723
6 10-14 71.371 36.261 35.110 71.011 36.140 34.871 70.637 36.006 34.631
7 15-19 69.674 35.070 34.604 70.823 35.732 35.091 71.326 36.065 35.261
8 20-24 56.769 28.189 28.580 58.275 29.015 29.260 60.356 30.138 30.218
9 25-29 53.661 26.443 27.218 53.198 26.179 27.019 52.857 25.996 26.861
10 30-34 53.401 26.298 27.103 53.951 26.575 27.376 54.059 26.617 27.442
11 35-39 45.319 22.146 23.173 47.304 23.165 24.139 49.228 24.153 25.075
12 40-44 37.162 17.942 19.220 38.408 18.576 19.832 39.881 19.329 20.552
13 45-49 33.457 16.015 17.442 33.916 16.242 17.674 34.431 16.496 17.935
14 50-54 30.572 14.511 16.061 30.997 14.695 16.302 31.444 14.905 16.539
15 55-59 27.305 13.061 14.244 27.827 13.258 14.569 28.286 13.416 14.870
16 60-64 22.371 10.601 11.770 23.126 10.957 12.169 23.868 11.309 12.559
17 65-69 17.234 8.100 9.134 17.855 8.359 9.496 18.519 8.640 9.879
18 70-74 12.994 6.011 6.983 13.363 6.153 7.210 13.773 6.314 7.459
19 75-79 9.320 4.154 5.166 9.601 4.285 5.316 9.872 4.403 5.469
20 80-84 5.613 2.310 3.303 5.847 2.409 3.438 6.096 2.520 3.576
21 85-89 2.661 1.001 1.660 2.812 1.048 1.764 2.964 1.096 1.868
22 90-94 872 269 603 964 306 658 1.055 339 716
23 95-99 216 54 162 215 51 164 227 54 173
24 100 y más 39 12 27 49 14 35 55 15 40
25 NA NA 2013 NA NA 2014 NA NA 2015 NA
26 Edad NA NA NA NA NA NA NA NA NA
27 NA Ambos secsos Varones Mujeres Ambos secsos Varones Mujeres Ambos secsos Varones Mujeres
28 Total 710.121 350.301 359.820 718.971 354.747 364.224 727.780 359.175 368.605
29 0-4 65.579 33.809 31.770 66.140 34.085 32.055 66.721 34.361 32.360
30 5-9 66.390 34.106 32.284 65.671 33.793 31.878 65.118 33.566 31.552
31 10-14 70.221 35.845 34.376 69.701 35.630 34.071 69.077 35.359 33.718
32 15-19 71.347 36.152 35.195 71.110 36.104 35.006 70.768 35.999 34.769
33 20-24 62.697 31.398 31.299 64.922 32.603 32.319 66.709 33.587 33.122
34 25-29 52.760 25.957 26.803 53.074 26.146 26.928 53.930 26.629 27.301
35 30-34 53.852 26.488 27.364 53.459 26.257 27.202 52.989 25.982 27.007
36 35-39 50.977 25.051 25.926 52.416 25.790 26.626 53.452 26.324 27.128
37 40-44 41.547 20.181 21.366 43.359 21.111 22.248 45.290 22.099 23.191
38 45-49 35.051 16.804 18.247 35.841 17.199 18.642 36.839 17.705 19.134
39 50-54 31.904 15.133 16.771 32.367 15.368 16.999 32.827 15.598 17.229
40 55-59 28.708 13.556 15.152 29.118 13.698 15.420 29.535 13.863 15.672
41 60-64 24.580 11.641 12.939 25.237 11.935 13.302 25.828 12.181 13.647
42 65-69 19.211 8.940 10.271 19.927 9.260 10.667 20.654 9.592 11.062
43 70-74 14.222 6.494 7.728 14.722 6.694 8.028 15.269 6.920 8.349
44 75-79 10.153 4.517 5.636 10.445 4.633 5.812 10.764 4.754 6.010
45 80-84 6.348 2.635 3.713 6.604 2.752 3.852 6.848 2.863 3.985
46 85-89 3.121 1.148 1.973 3.281 1.203 2.078 3.455 1.267 2.188
47 90-94 1.146 373 773 1.233 400 833 1.323 428 895
48 95-99 251 60 191 286 73 213 323 85 238
49 100 y más 56 13 43 58 13 45 61 13 48
As you can see, on line 25 new columns appear, but they are incorrectly imported by R since it doesn't detect them as actual columns.
What I need is to have the 'total' column and then one column per year only for the section that says 'mujeres', not the other two sections. This is my ideal output that I built by hand and would like to have with R:
df %>%
head(15)
# A tibble: 15 × 7
Total `2010` `2011` `2012` `2013` `2014` `2015`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 a 4 31416 31430 31557 31770 32055 32360
2 5 a 9 33580 33168 32723 32284 31878 31552
3 10 a 14 35110 34871 34631 34376 34071 33718
4 15 a 19 34604 35091 35261 35195 35006 34769
5 20 a 24 28580 29260 30218 31299 32319 33122
6 25 a 29 27218 27019 26861 26803 26928 27301
7 30 a 34 27103 27376 27442 27364 27202 27007
8 35 a 39 23173 24139 25075 25926 26626 27128
9 40 a 44 19220 19832 20552 21366 22248 23191
10 45 a 49 17442 17674 17935 18247 18642 19134
11 50 a 54 16061 16302 16539 16771 16999 17229
12 55 a 59 14244 14569 14870 15152 15420 15672
13 60 a 64 11770 12169 12559 12939 13302 13647
14 65 a 69 9134 9496 9879 10271 10667 11062
15 70 a 74 6983 7210 7459 7728 8028 8349
PD: had to wrote 'secso' cause Posit doesn't let me write the correct word
EDIT: Sorry, I just realized that the output I posted doesn’t clearly show the issue. I’ve edited it now so my difficulties can be better understood.