Hi, I am attempting to extract election results data from a pdf document (because some county governments refuse to embrace things like spreadsheets in the year 2024).
The pdf is linked here.
I have already done some of the work to extract the data and get it to a place where I might be able to separate the strings and re-create the table from the PDF.
This is the initial code to get the data in a manipulatable format:
download.file("https://www.bergencountyclerk.gov/_Content/pdf/ElectionResult/District%20Canvass%206-14-24.pdf", "ak4nj_Bergen_24USSEN_Primary_PrecinctResults.pdf", mode = "wb")
Bergen_pdf <- pdf_text("ak4nj_Bergen_24USSEN_Primary_PrecinctResults.pdf")
vote_types = "early voting|election day|mail-in|provisional|total"
temp <- Bergen_pdf |>
str_split('\n') |>
head() |>
unlist() |>
str_to_lower() |>
str_trim() |>
as_tibble() |>
mutate(x = value) |>
select(x) |>
filter(str_detect(x, vote_types))
From here, I've been trying to use separate_wider_regex() to separate the strings and re-create the table, largely using whitespaces as the separator. The code looks like this:
temp |>
separate_wider_regex(x,
patterns = c(
precinct = "\\w+\\s\\w*\\s?\\d+",
"\\s+",
vote_type = vote_types,
"\\s+",
regvoters = "\\d+",
"\\s+",
total_votes = "\\d+",
"\\s+",
turnout_percent = "\\d+\\.\\d+",
"%\\s+",
tb_potus = "\\d+",
"\\s+",
jb_potus = "\\d+",
"\\s+",
uc_potus = "\\d+",
"\\s+",
ak_votes = "\\d+",
"\\s+",
lh_votes = "\\d+",
"\\s+",
pcm_votes = "\\d+",
"\\s+"
),
too_few = "debug")
This code does end up creating the structure I want, but it creates a big issue. For instance, if you look down the "DEM - TERRISA BUKOVINAC" column on the PDF, you'll notice there are some rows where there is a number, but for districts where the candidate received 0 votes, there is just whitespace. Which, obviously creates a problem because my code will read that whitespace as something to be ignored and pull forward the digits/vote totals for the Biden column into the Bukovinac column because those are the first digits after whitespace.
I have tried a bunch of different combinations of regular expressions to fix this, but I can't figure out how to coerce those whitespaces in the Bukovinac column to 0 and NA and leave all of the data after that column in its right place.
Thanks in advance for any help!