I've got some voter data that I need to marry to some shapefile data. The voter data does not have the unique IDs in the shapefile data but I can create it from a couple of variables in the voter data, then join them. 'Can' is theoretical here as I've spent over a day on this and haven't solved it.
Here's a mini reprex:
df <- data.frame(precinct = c("TOWN OF BLACK WOLF WARDS 1-3", "TOWN OF WOLF RIVER WARDS 1-2A", "TOWN OF WOLF RIVER WARDS 2B-2C", "CITY OF TOMAH WARDS 1-5A,6", "CITY OF TOMAH WARDS 5B,24"),
jurisdiction_fips = c(5513908000, 5513988475, 5513988475, 5508180075, 5508180075))
I need to take the alphanumeric strings at the end of the 'precinct' name and append them to the 'jurisdiction_fips' padded to 4-digits with zeroes.
So the output should look like (only using the last two lines of the df here for brevity):
My cleverest attempt was pull off the alphanumeric stings via regex, change the dashes to colons, then use eval(parse(text = 'characters stuff here')) but I can't separate/mutate that into any additional columns. Help, please.
There's only the first element covered by the eval(). strsplit(x,"-") does the second element and strsplit(x,",") will get the third and fifth list element (with appropriate adjustment of the delimiter).
The remaining case is trickier. "1-5A,6" has to be split on , and on - separately. The second element can go on to be padded, leaving 1,5A. The second element can go on to be padded after copying out the numeric portion 5, converting it to numeric and subtracting 1, leaving 4 and convert that back to character. paste0("1",":","4") and proceed as you did before with eval. Then pad everything out to the same width as you've done and paste0() it to the GEOID as you've done.
Then circle back to write the regex to classify the input into cases to apply the logic above.
The problem is a great illustration of divide-and-conquer of splitting everything down to its minimal representation and applying successive small operations.
Thanks for the reply. Still having problems. I guess the first relates to your opening sentence: "It's not easy work when everything looks like a special case." I agree. The big-picture problem for me is that this is just a small sample of my data. The actual dataset I'm using has over 11K rows and probably a hundred 'special cases'
and easily get to something where I have all of the 'suffixes' pulled off into their own column but eval(parse(text= 'blah')) only works on the first row item. So I'm still stuck where a brute-force line-by-line approach is a solution, but not one that's practical given the dataset.
So looking for something like this:
#This Df...
sampleDf = data.frame(Freq = c("Town A","Town B", "City A","Town C"),
seq = c("A","B A B B B","A B B A","A B A A B A B"))
# Does what I'm after with this
sampleDf <- sampleDf %>%
mutate(n = row_number()) %>%
separate_rows(seq, sep = ' ') %>%
group_by(n, Freq) %>%
mutate(n2 = row_number()) %>%
spread(n2, seq)
This works on that example because all of the elements in the 'seq' column have a length of 1. I can't seem to get anything analogous to work in my example. Everything I do to split the geoAppendix strings leaves me elements of length 3, 4, etc.
So, yes, it's feasible to hardwire three distinct patterns, but for several hundred not. It might be worthwhile to try splitting off the column on the last space and classifying the patterns to see if they reflect some discernible set of encoding schemes. For example 5A and 5a could be treated similarily with toupper(). Of course if you had an uncompressed list of precincts you'd be using that.
If there are still too darn many, the next resort would be to machine learning, which we can try to puzzle out if the attempt to reduce the number of variations to a management number by harmonizing simple differences in punctuation and capitalization, etc.
I'm more pessimistic after taking a sample of the MIT Election Database results for 2020. Two-thirds as straightforward but the last category looks to be a killer
# Load required libraries
library(stringr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Read the CSV file
precincts <- read.csv("https://gist.githubusercontent.com/technocrat/6ce362aef49ff0a99b188faadd68d28b/raw/3fd015c370676f2aa0459288cb77df0bea3b8fff/precincts.csv", stringsAsFactors = FALSE)
# Function to classify precincts
classify_precinct <- function(precinct) {
has_Numeral <- str_detect(precinct, "\\d")
has_letter <- str_detect(precinct, "[a-zA-Z]")
has_punctuation <- str_detect(precinct, "[ ~[:punct:]]")
if (has_Numeral & has_letter & has_punctuation) {
return("Numeral, Letter, Punctuation")
} else if (has_Numeral & has_letter) {
return("Numeral, Letter")
} else if (has_Numeral & has_punctuation) {
return("Numeral, Punctuation")
} else if (has_letter & has_punctuation) {
return("Letter, Punctuation")
} else if (has_Numeral) {
return("Numeral")
} else if (has_letter) {
return("Letter")
} else if (has_punctuation) {
return("Punctuation")
} else {
return("Other")
}
}
# Apply the function to classify precincts
precincts$Category <- sapply(precincts$precinct, classify_precinct)
precincts |> group_by(Category) |> count()
#> # A tibble: 6 × 2
#> # Groups: Category [6]
#> Category n
#> <chr> <int>
#> 1 Letter 21
#> 2 Letter, Punctuation 27
#> 3 Numeral 74
#> 4 Numeral, Letter 18
#> 5 Numeral, Letter, Punctuation 171
#> 6 Numeral, Punctuation 173
precincts |> filter(Category == 'Numeral') |>
select(precinct) |> head()
#> precinct
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 7
precincts |> filter(Category == 'Letter') |>
select(precinct) |> head()
#> precinct
#> 1 AITKIN
#> 2 BOOKMAN
#> 3 BOONDOX
#> 4 BOSWELL
#> 5 COURTHOUSE
#> 6 FELSENTHAL
precincts |> filter(Category == 'Numeral, Letter') |>
select(precinct) |> head()
#> precinct
#> 1 4MSIC
#> 2 AV51
#> 3 DRP031
#> 4 F183
#> 5 FP04
#> 6 G03
precincts |> filter(Category == 'Numeral, Punctuation') |>
select(precinct) |> head()
#> precinct
#> 1 45.1
#> 2 8200031.0
#> 3 8700032.0
#> 4 00 15
#> 5 00 18
#> 6 00 192
precincts |> filter(Category == 'Numeral, Letter, Punctuation') |>
select(precinct)
#> precinct
#> 1 (B21) CITY OF BETTENDORF
#> 2 #10 PORTER PARK
#> 3 00 11A
#> 4 0018 CLARKSTOWN 19 (WARD 4)
#> 5 01 101 B
#> 6 0237 FOOTHILLS
#> 7 0250 COLONIE ED 43
#> 8 0313 NEW SCOTLAND ED 8
#> 9 05 01B
#> 10 05-1 LEOMA
#> 11 0526 RANCHO MIRAGE
#> 12 06013DANV104_002011007016
#> 13 06013PINL105_002005009015
#> 14 06013RICH150_002011009015
#> 15 060371500027E_003032022048
#> 16 060371850056C_003040032058
#> 17 060372000053A_003040024051
#> 18 060373750009B_001027025041
#> 19 060374400024B_003027025041
#> 20 060376050001A_001029018039
#> 21 060377750040A_003028026050
#> 22 060379001440A_003033026050
#> 23 060379002686A_003029018046
#> 24 060379002890A_003037030054
#> 25 060379990331A_003032022048
#> 26 060379990859A_003043030064
#> 27 06071ONT1406_001035020052
#> 28 06071UNI0279_001008020047
#> 29 1, WARD 2
#> 30 10, WARD 2
#> 31 102 - TIP TOP
#> 32 107_107 LAUREL MANOR REC
#> 33 110 DECLO
#> 34 13_COUNTY FLOATING
#> 35 1B LATHAM
#> 36 2_PRE 2 HAVANA PUBLIC LIBRARY
#> 37 208_208 - CLUB CONTINENTAL
#> 38 208_PCT 208
#> 39 216 CP2
#> 40 24-AVCB
#> 41 3, WARD 1
#> 42 4 NORTH ABERDEEN
#> 43 556_PRECINCT 556.0
#> 44 6_LEE CITY HALL
#> 45 74_PRECINCT 74
#> 46 779_PRECINCT 779.0
#> 47 8-D VESTA
#> 48 857_PRECINCT 857.0
#> 49 9_9 - PORT CHARLOTTE BEACH PARK
#> 50 9-1E
#> 51 ADAMS 3
#> 52 ANDOVER_WDR-3-1
#> 53 BLOOMINGDALE 055
#> 54 BROOKHAVEN 249
#> 55 BROOKHAVEN 256
#> 56 BROOKHAVEN 267
#> 57 BURTON 1C
#> 58 CANTON 9
#> 59 CASEYVILLE 11
#> 60 CAUSEWAY BRANCH 1
#> 61 CEDAR RAPIDS 23
#> 62 CITY OF APPLETON WARD 13
#> 63 CITY OF GREEN BAY WARD 10
#> 64 CITY OF JANESVILLE WARD 3
#> 65 CITY OF JANESVILLE WARD 9
#> 66 CITY OF MILWAUKEE WARD 16
#> 67 CITY OF MILWAUKEE WARD 243
#> 68 CITY OF NEW ROCHELLE - 4
#> 69 CITY OF WAUSAU WARD 32
#> 70 CLAY 08
#> 71 CLIFTY - 3700
#> 72 COLS 83-E
#> 73 CONC 001
#> 74 COVENTRY 0605
#> 75 DEKALB 03
#> 76 DIST. 4, CARPENTER PRECINCT
#> 77 DISTRICT 11-STRONG SCHOOL
#> 78 DULUTH P-32
#> 79 EAGLE MOUNTAIN 09
#> 80 EDINA P-12
#> 81 EVERETT 3
#> 82 FAIR OAKS 04
#> 83 FARMINGTON 08:UBC-
#> 84 FRANKLIN GROVE 2
#> 85 GREECE 3
#> 86 GREECE 68
#> 87 GURDON WARD 5
#> 88 HAMB 021
#> 89 HARLEM 2
#> 90 HARRISON - 4450
#> 91 HOISINGTON WARD 3
#> 92 ISLIP 119
#> 93 JAMAICA 1
#> 94 KAYSVILLE 22:I-C-
#> 95 LANC 010
#> 96 LAYTON 38:I-N-
#> 97 LEG. DIST. 26 12
#> 98 LEHI 23
#> 99 LERAY 7
#> 100 LORRAINE 1
#> 101 MANTENO 2
#> 102 MARLBORO_WDH-5
#> 103 MASON CITY W-4 P-1
#> 104 MIDDLETOWN 1902
#> 105 MINNEAPOLIS W-3 P-03
#> 106 MT. PLEASANT 20
#> 107 MT. STERLING 1
#> 108 NAN_MID - 014
#> 109 NOBLE 09
#> 110 NORTH LOGAN 06
#> 111 NORTH SMITHFIELD 2502
#> 112 OMAK #10
#> 113 ONEONTA CITY WARD 6
#> 114 PALM TREE - D3
#> 115 PATOKA 11
#> 116 PATTON 7
#> 117 PCT 260
#> 118 PCT 423
#> 119 PCT 593
#> 120 PER 044
#> 121 PERINTON 30
#> 122 PERU 7
#> 123 PLATTEKILL 5
#> 124 POMFRET 8
#> 125 PRECINCT 0005
#> 126 PRECINCT 0031
#> 127 PRECINCT 12
#> 128 PRECINCT 25
#> 129 PRECINCT 27
#> 130 PRECINCT 317
#> 131 PRECINCT 40
#> 132 PRECINCT 49
#> 133 PRECINCT 6
#> 134 PRECINCT 93
#> 135 PRECINCT NO. 12
#> 136 PRECINCT TOLEDO 21H
#> 137 PROVIDENCE 2817
#> 138 RAY 1
#> 139 RENO-VERDI 1040
#> 140 RIVER HEIGHTS 02
#> 141 RIVERHEAD 6
#> 142 ROCHESTER W6 P5
#> 143 SANDY CREEK 3
#> 144 ST. PAUL W-1 P-04
#> 145 STAMPS WARD 1, PCT 2 (OUT)
#> 146 SUBURBAN 2
#> 147 TOWN OF CATSKILL 14 LD 1
#> 148 TOWN OF DARLINGTON WARDS 1-3
#> 149 TOWN OF DURHAM 2 LD 9
#> 150 TOWN OF FIFIELD WARD 2
#> 151 TOWN OF KNOX WARD 1
#> 152 TOWN OF LINCOLN 2
#> 153 TOWN OF POUGHKEEPSIE W 6 ED 2
#> 154 TOWN OF ROCKBRIDGE WARDS 1-3
#> 155 TOWN OF RYE - 16
#> 156 TOWN OF SHARON WARDS 1-3
#> 157 TOWN OF WAPPINGER W 1 ED 4
#> 158 TROY 19
#> 159 URBANDALE 10
#> 160 VERGENNES_ADD-3
#> 161 VILLAGE OF GENOA CITY WARD 5
#> 162 VILLAGE OF MONTFORT WARD 1
#> 163 W2 - P626
#> 164 WARD 07 PRECINCT 25
#> 165 WARD 09 PRECINCT 01
#> 166 WARD 11
#> 167 WARD 11 PRECINCT 12
#> 168 WARD 4 PRECINCT 02
#> 169 WEST BOUNTIFUL 03:I-S-
#> 170 WILLIAMSON 03
#> 171 YORK 018
Sorry for the delayed reply. I had some work projects get moved up with some holiday travel in the middle. Hope to get back into this data and look at your reply by next week. I think the MIT election data and the Harvard files are the same datasets (and I maybe pulled from the latter; will have to check to make sure).
Thanks for the time and effort and let me see how much further along I get before replying in full.