Extracting Data from PDF file with uneven whitespaces and data

Hi, I am attempting to extract election results data from a pdf document (because some county governments refuse to embrace things like spreadsheets in the year 2024).

The pdf is linked here.

I have already done some of the work to extract the data and get it to a place where I might be able to separate the strings and re-create the table from the PDF.

This is the initial code to get the data in a manipulatable format:

download.file("https://www.bergencountyclerk.gov/_Content/pdf/ElectionResult/District%20Canvass%206-14-24.pdf", "ak4nj_Bergen_24USSEN_Primary_PrecinctResults.pdf", mode = "wb")
Bergen_pdf <- pdf_text("ak4nj_Bergen_24USSEN_Primary_PrecinctResults.pdf")
vote_types = "early voting|election day|mail-in|provisional|total"
temp <- Bergen_pdf |> 
  str_split('\n') |> 
  head() |> 
  unlist() |> 
  str_to_lower() |> 
  str_trim() |> 
  as_tibble() |>
  mutate(x = value) |> 
  select(x) |> 
  filter(str_detect(x, vote_types))

From here, I've been trying to use separate_wider_regex() to separate the strings and re-create the table, largely using whitespaces as the separator. The code looks like this:

  temp |> 
  separate_wider_regex(x,
                       patterns = c(
                         precinct = "\\w+\\s\\w*\\s?\\d+",
                         "\\s+",
                         vote_type = vote_types,
                         "\\s+",
                         regvoters = "\\d+",
                         "\\s+",
                         total_votes = "\\d+",
                         "\\s+",
                         turnout_percent = "\\d+\\.\\d+",
                         "%\\s+",
                         tb_potus = "\\d+",
                          "\\s+",
                         jb_potus = "\\d+",
                         "\\s+",
                         uc_potus = "\\d+",
                         "\\s+",
                         ak_votes = "\\d+",
                         "\\s+",
                         lh_votes = "\\d+",
                         "\\s+",
                         pcm_votes = "\\d+",
                         "\\s+"
                       ),
                       too_few = "debug")

This code does end up creating the structure I want, but it creates a big issue. For instance, if you look down the "DEM - TERRISA BUKOVINAC" column on the PDF, you'll notice there are some rows where there is a number, but for districts where the candidate received 0 votes, there is just whitespace. Which, obviously creates a problem because my code will read that whitespace as something to be ignored and pull forward the digits/vote totals for the Biden column into the Bukovinac column because those are the first digits after whitespace.

I have tried a bunch of different combinations of regular expressions to fix this, but I can't figure out how to coerce those whitespaces in the Bukovinac column to 0 and NA and leave all of the data after that column in its right place.

Thanks in advance for any help!

Perhaps try read_fwf - read the data in as fixed width?

Had to trial and error the widths for each column, but it did eventually work. thanks!

Awesome! My distant past programming with 80 character punch cards comes through.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.