You don't need read_lines()
, pdf_text()
already gives you a character vector for each page, and since each page has headers and footers you want to process them one by one and group them at the end.
> ops <- pdf_text(temps)
> length(ops)
[1] 43
> nchar(ops[[1]])
[1] 5489
# the first page has more titles
> substr(ops[[1]], 1, 300)
[1] " Data Date: 24-May-2022\n\n\n\n GILT MARKET\n "
# the other pages start directly with the column headers
> substr(ops[[2]], 1, 300)
[1] "Auction Date Gilt Name Indexation Lag Amount Issued Amount Average Bid to Cover Yield Yield Issue Date\n for Index-linked (£ million Sold to Accepted Ratio a"
For now let's leave the first page aside, you need to write code that can process a single page. For example:
> onepage <- readr::read_fwf(ops[[2]],
+ skip = 8)
Rows: 36 Columns: 14
-- Column specification --------------------------------------------------------
chr (7): X1, X2, X3, X6, X11, X12, X13
dbl (6): X4, X5, X8, X9, X10, X14
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
> onepage
# A tibble: 36 x 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
<chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 06-O~ 0½% ~ NA NA NA NA 3125 0 96.8 2.21 0.94~ 0.3 07-O~
2 05-O~ 0¼% ~ NA NA NA NA 3000 0 99.2 2.51 0.49~ 0.3 06-O~
3 05-O~ 1 1/~ NA NA NA NA 2288. 0 96.4 2.08 1.35~ 0.4 06-O~
4 28-S~ 1¼% ~ NA NA NA NA 2000 0 98.0 2.05 1.33~ 1.1 29-S~
5 22-S~ 0 1/~ Gilt 2056 3 mont~ 350 0 232. 2.12 -2.2~ N/A 23-S~
6 15-S~ 0¼% ~ NA NA NA NA 2858. 0 95.3 2.52 0.74~ 0.2 16-S~
7 14-S~ 0 3/~ NA NA NA NA 3000 0 99.7 2.67 0.42~ 0.2 15-S~
8 08-S~ 0 1/~ Gilt 2031 3 mont~ 1011. 0 137. 2.33 -3.0~ N/A 09-S~
9 07-S~ 0¼% ~ NA NA NA NA 3250 0 100. 2.81 0.25~ 0.2 08-S~
10 07-S~ 1 5/~ NA NA NA NA 1250 0 129. 2.23 0.89~ 0.1 08-S~
# ... with 26 more rows, and 1 more variable: X14 <dbl>
> tail(onepage)
# A tibble: 6 x 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
<chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 02-Ju~ 0 7/~ NA NA NA NA 2500 0 89.9 2.4 1.35~ 0.2 03-J~
2 19-Ma~ 0 5/~ NA NA NA NA 3094. 0 91.7 2.53 1.26~ 0.2 20-M~
3 18-Ma~ 0 1/~ NA NA NA NA 4062. 0 99.8 2.6 0.18~ 0.1 19-M~
4 NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA NA Page~
# ... with 1 more variable: X14 <dbl>
# Oops, that last row should be removed
> onepage |>
+ slice(-n())
# A tibble: 35 x 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
<chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 06-O~ 0½% ~ NA NA NA NA 3125 0 96.8 2.21 0.94~ 0.3 07-O~
2 05-O~ 0¼% ~ NA NA NA NA 3000 0 99.2 2.51 0.49~ 0.3 06-O~
3 05-O~ 1 1/~ NA NA NA NA 2288. 0 96.4 2.08 1.35~ 0.4 06-O~
4 28-S~ 1¼% ~ NA NA NA NA 2000 0 98.0 2.05 1.33~ 1.1 29-S~
5 22-S~ 0 1/~ Gilt 2056 3 mont~ 350 0 232. 2.12 -2.2~ N/A 23-S~
6 15-S~ 0¼% ~ NA NA NA NA 2858. 0 95.3 2.52 0.74~ 0.2 16-S~
7 14-S~ 0 3/~ NA NA NA NA 3000 0 99.7 2.67 0.42~ 0.2 15-S~
8 08-S~ 0 1/~ Gilt 2031 3 mont~ 1011. 0 137. 2.33 -3.0~ N/A 09-S~
9 07-S~ 0¼% ~ NA NA NA NA 3250 0 100. 2.81 0.25~ 0.2 08-S~
10 07-S~ 1 5/~ NA NA NA NA 1250 0 129. 2.23 0.89~ 0.1 08-S~
# ... with 25 more rows, and 1 more variable: X14 <dbl>
You will need to do a lot more work on this to specify the column names and type etc. When you have a satisfying solution, you can put it in a function and run it on all the pages (except the first one):
process_page <- function(string_one_page){
onepage <- readr::read_fwf(string_one_page,
skip = 8)
stop_for_problems(onepage)
onepage
}
all_other_pages <- map_dfr(ops[-1], process_page)
Here since I'm using map_dfr()
, everything will be combined automatically (if the columns do match). If you prefer, you can also use map()
to get a list, then do some more processing before binding the rows.
Separately, do the same for page 1 (you will need to change skip
), finally put all these separate dataframes together with bind_rows()
.
A difficulty here is that you need to double-check the data a lot to make sure that you don't have a subtle mistake (spoiler alert: you likely will, in such a situation it's impossible to get everything right on the first try). When you think you're finished, also make sure to use hist()
or table()
on each column to detect incorrect values.