URL pdf to dataframe

Hi there,

I'm trying to convert a pdf into a dataframe, however because the column titles are being repeated on each page (and there's a note on the final page), I'm finding it difficult to think of an appropriate way of putting it into a dataframe while simultaneously dealing with any dynamic issues.

I've done read it into R as an object in the following method:

library(pdftools)
library(dplyr)
library(tidyverse)

temps <- tempfile(fileext = ".pdf")

download.file("https://www.dmo.gov.uk/dmo_static_reports/Gilt%20Operations.pdf", destfile = temps, mode="wb")

And I'm thinking I would do something like:

ops <- pdf_text(temps) %>%
  readr::read_lines()

Then drop the unnecessary lines explicitly, but I don't think that'll work in the long term.

Grateful for any advice on how to complete this task.

You don't need read_lines(), pdf_text() already gives you a character vector for each page, and since each page has headers and footers you want to process them one by one and group them at the end.

> ops <- pdf_text(temps)
> length(ops)
[1] 43
> nchar(ops[[1]])
[1] 5489
# the first page has more titles
> substr(ops[[1]], 1, 300)
[1] "                                                                                                                                                          Data Date: 24-May-2022\n\n\n\n                                                                      GILT MARKET\n                                      "
# the other pages start directly with the column headers
> substr(ops[[2]], 1, 300)
[1] "Auction Date    Gilt Name                                Indexation Lag Amount Issued    Amount       Average      Bid to Cover    Yield          Yield     Issue Date\n                                                        for Index-linked  (£ million     Sold to      Accepted        Ratio         a"

For now let's leave the first page aside, you need to write code that can process a single page. For example:

> onepage <- readr::read_fwf(ops[[2]],
+                            skip = 8)
Rows: 36 Columns: 14                                                                                               
-- Column specification --------------------------------------------------------

chr (7): X1, X2, X3, X6, X11, X12, X13
dbl (6): X4, X5, X8, X9, X10, X14

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
> onepage
# A tibble: 36 x 14
   X1    X2    X3       X4    X5 X6       X7    X8    X9   X10 X11   X12   X13  
   <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
 1 06-O~ 0½% ~ NA       NA    NA NA    3125      0  96.8  2.21 0.94~ 0.3   07-O~
 2 05-O~ 0¼% ~ NA       NA    NA NA    3000      0  99.2  2.51 0.49~ 0.3   06-O~
 3 05-O~ 1 1/~ NA       NA    NA NA    2288.     0  96.4  2.08 1.35~ 0.4   06-O~
 4 28-S~ 1¼% ~ NA       NA    NA NA    2000      0  98.0  2.05 1.33~ 1.1   29-S~
 5 22-S~ 0 1/~ Gilt   2056     3 mont~  350      0 232.   2.12 -2.2~ N/A   23-S~
 6 15-S~ 0¼% ~ NA       NA    NA NA    2858.     0  95.3  2.52 0.74~ 0.2   16-S~
 7 14-S~ 0 3/~ NA       NA    NA NA    3000      0  99.7  2.67 0.42~ 0.2   15-S~
 8 08-S~ 0 1/~ Gilt   2031     3 mont~ 1011.     0 137.   2.33 -3.0~ N/A   09-S~
 9 07-S~ 0¼% ~ NA       NA    NA NA    3250      0 100.   2.81 0.25~ 0.2   08-S~
10 07-S~ 1 5/~ NA       NA    NA NA    1250      0 129.   2.23 0.89~ 0.1   08-S~
# ... with 26 more rows, and 1 more variable: X14 <dbl>

> tail(onepage)
# A tibble: 6 x 14
  X1     X2    X3       X4    X5 X6       X7    X8    X9   X10 X11   X12   X13  
  <chr>  <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 02-Ju~ 0 7/~ NA       NA    NA NA    2500      0  89.9  2.4  1.35~ 0.2   03-J~
2 19-Ma~ 0 5/~ NA       NA    NA NA    3094.     0  91.7  2.53 1.26~ 0.2   20-M~
3 18-Ma~ 0 1/~ NA       NA    NA NA    4062.     0  99.8  2.6  0.18~ 0.1   19-M~
4 NA     NA    NA       NA    NA NA      NA     NA  NA   NA    NA    NA    NA   
5 NA     NA    NA       NA    NA NA      NA     NA  NA   NA    NA    NA    NA   
6 NA     NA    NA       NA    NA NA      NA     NA  NA   NA    NA    NA    Page~
# ... with 1 more variable: X14 <dbl>
# Oops, that last row should be removed
> onepage |>
+   slice(-n())
# A tibble: 35 x 14
   X1    X2    X3       X4    X5 X6       X7    X8    X9   X10 X11   X12   X13  
   <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
 1 06-O~ 0½% ~ NA       NA    NA NA    3125      0  96.8  2.21 0.94~ 0.3   07-O~
 2 05-O~ 0¼% ~ NA       NA    NA NA    3000      0  99.2  2.51 0.49~ 0.3   06-O~
 3 05-O~ 1 1/~ NA       NA    NA NA    2288.     0  96.4  2.08 1.35~ 0.4   06-O~
 4 28-S~ 1¼% ~ NA       NA    NA NA    2000      0  98.0  2.05 1.33~ 1.1   29-S~
 5 22-S~ 0 1/~ Gilt   2056     3 mont~  350      0 232.   2.12 -2.2~ N/A   23-S~
 6 15-S~ 0¼% ~ NA       NA    NA NA    2858.     0  95.3  2.52 0.74~ 0.2   16-S~
 7 14-S~ 0 3/~ NA       NA    NA NA    3000      0  99.7  2.67 0.42~ 0.2   15-S~
 8 08-S~ 0 1/~ Gilt   2031     3 mont~ 1011.     0 137.   2.33 -3.0~ N/A   09-S~
 9 07-S~ 0¼% ~ NA       NA    NA NA    3250      0 100.   2.81 0.25~ 0.2   08-S~
10 07-S~ 1 5/~ NA       NA    NA NA    1250      0 129.   2.23 0.89~ 0.1   08-S~
# ... with 25 more rows, and 1 more variable: X14 <dbl>

You will need to do a lot more work on this to specify the column names and type etc. When you have a satisfying solution, you can put it in a function and run it on all the pages (except the first one):

process_page <- function(string_one_page){
  onepage <- readr::read_fwf(string_one_page,
                             skip = 8)
  
  stop_for_problems(onepage)
  onepage
}

all_other_pages <- map_dfr(ops[-1], process_page)

Here since I'm using map_dfr(), everything will be combined automatically (if the columns do match). If you prefer, you can also use map() to get a list, then do some more processing before binding the rows.

Separately, do the same for page 1 (you will need to change skip), finally put all these separate dataframes together with bind_rows().

A difficulty here is that you need to double-check the data a lot to make sure that you don't have a subtle mistake (spoiler alert: you likely will, in such a situation it's impossible to get everything right on the first try). When you think you're finished, also make sure to use hist() or table() on each column to detect incorrect values.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.