Extracting Data from PDF file with uneven whitespaces and data

bricey16 · July 14, 2024, 10:01pm

Hi, I am attempting to extract election results data from a pdf document (because some county governments refuse to embrace things like spreadsheets in the year 2024).

The pdf is linked here.

I have already done some of the work to extract the data and get it to a place where I might be able to separate the strings and re-create the table from the PDF.

This is the initial code to get the data in a manipulatable format:

download.file("https://www.bergencountyclerk.gov/_Content/pdf/ElectionResult/District%20Canvass%206-14-24.pdf", "ak4nj_Bergen_24USSEN_Primary_PrecinctResults.pdf", mode = "wb")
Bergen_pdf <- pdf_text("ak4nj_Bergen_24USSEN_Primary_PrecinctResults.pdf")
vote_types = "early voting|election day|mail-in|provisional|total"
temp <- Bergen_pdf |> 
  str_split('\n') |> 
  head() |> 
  unlist() |> 
  str_to_lower() |> 
  str_trim() |> 
  as_tibble() |>
  mutate(x = value) |> 
  select(x) |> 
  filter(str_detect(x, vote_types))

From here, I've been trying to use separate_wider_regex() to separate the strings and re-create the table, largely using whitespaces as the separator. The code looks like this:

  temp |> 
  separate_wider_regex(x,
                       patterns = c(
                         precinct = "\\w+\\s\\w*\\s?\\d+",
                         "\\s+",
                         vote_type = vote_types,
                         "\\s+",
                         regvoters = "\\d+",
                         "\\s+",
                         total_votes = "\\d+",
                         "\\s+",
                         turnout_percent = "\\d+\\.\\d+",
                         "%\\s+",
                         tb_potus = "\\d+",
                          "\\s+",
                         jb_potus = "\\d+",
                         "\\s+",
                         uc_potus = "\\d+",
                         "\\s+",
                         ak_votes = "\\d+",
                         "\\s+",
                         lh_votes = "\\d+",
                         "\\s+",
                         pcm_votes = "\\d+",
                         "\\s+"
                       ),
                       too_few = "debug")

This code does end up creating the structure I want, but it creates a big issue. For instance, if you look down the "DEM - TERRISA BUKOVINAC" column on the PDF, you'll notice there are some rows where there is a number, but for districts where the candidate received 0 votes, there is just whitespace. Which, obviously creates a problem because my code will read that whitespace as something to be ignored and pull forward the digits/vote totals for the Biden column into the Bukovinac column because those are the first digits after whitespace.

I have tried a bunch of different combinations of regular expressions to fix this, but I can't figure out how to coerce those whitespaces in the Bukovinac column to 0 and NA and leave all of the data after that column in its right place.

Thanks in advance for any help!

Ajackson · July 15, 2024, 1:14pm

Perhaps try read_fwf - read the data in as fixed width?

bricey16 · July 16, 2024, 5:52am

Had to trial and error the widths for each column, but it did eventually work. thanks!

Ajackson · July 16, 2024, 7:55pm

Awesome! My distant past programming with 80 character punch cards comes through.

system · July 23, 2024, 7:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.