Append elements of alpha-numeric string to generate new unique IDs

speedreader · November 16, 2023, 8:54pm

I've got some voter data that I need to marry to some shapefile data. The voter data does not have the unique IDs in the shapefile data but I can create it from a couple of variables in the voter data, then join them. 'Can' is theoretical here as I've spent over a day on this and haven't solved it.

Here's a mini reprex:

df <- data.frame(precinct = c("TOWN OF BLACK WOLF WARDS 1-3", "TOWN OF WOLF RIVER WARDS 1-2A", "TOWN OF WOLF RIVER WARDS 2B-2C", "CITY OF TOMAH WARDS 1-5A,6", "CITY OF TOMAH WARDS 5B,24"),
                 jurisdiction_fips = c(5513908000, 5513988475, 5513988475, 5508180075, 5508180075))

I need to take the alphanumeric strings at the end of the 'precinct' name and append them to the 'jurisdiction_fips' padded to 4-digits with zeroes.

So the output should look like (only using the last two lines of the df here for brevity):

df_out <- data.frame(precinct = rep("CITY OF TOMAH WARDS", 8), 
                     GEOID = c("55081800750001", "55081800750002", "55081800750003", "55081800750004", "5508180075005A", "55081800750006", "5508180075005B", "55081800750024"))

My cleverest attempt was pull off the alphanumeric stings via regex, change the dashes to colons, then use eval(parse(text = 'characters stuff here')) but I can't separate/mutate that into any additional columns. Help, please.

technocrat · November 16, 2023, 10:32pm

It's not easy work when everything looks like a special case. We have to deal with

tails
[[1]]
[1] "1-3"

[[2]]
[1] "1-2A"

[[3]]
[1] "2B-2C"

[[4]]
[1] "1-5A,6"

[[5]]
[1] "5B,24"

There's only the first element covered by the eval(). strsplit(x,"-") does the second element and strsplit(x,",") will get the third and fifth list element (with appropriate adjustment of the delimiter).

The remaining case is trickier. "1-5A,6" has to be split on , and on - separately. The second element can go on to be padded, leaving 1,5A. The second element can go on to be padded after copying out the numeric portion 5, converting it to numeric and subtracting 1, leaving 4 and convert that back to character. paste0("1",":","4") and proceed as you did before with eval. Then pad everything out to the same width as you've done and paste0() it to the GEOID as you've done.

Then circle back to write the regex to classify the input into cases to apply the logic above.

The problem is a great illustration of divide-and-conquer of splitting everything down to its minimal representation and applying successive small operations.

speedreader · November 17, 2023, 3:42pm

technocrat:

Thanks for the reply. Still having problems. I guess the first relates to your opening sentence: "It's not easy work when everything looks like a special case." I agree. The big-picture problem for me is that this is just a small sample of my data. The actual dataset I'm using has over 11K rows and probably a hundred 'special cases'

So with the df above I can do something like:

df <- df %>% mutate(geoAppendix = str_extract(precinct, "(\\d.*?)$"))
df$geoAppendix <- gsub("-", ":", df$geoAppendix)

and easily get to something where I have all of the 'suffixes' pulled off into their own column but eval(parse(text= 'blah')) only works on the first row item. So I'm still stuck where a brute-force line-by-line approach is a solution, but not one that's practical given the dataset.

So looking for something like this:

#This Df...
sampleDf = data.frame(Freq = c("Town A","Town B", "City A","Town C"), 
                  seq = c("A","B A B B B","A B B A","A B A A B A B"))

# Does what I'm after with this 
sampleDf <- sampleDf %>% 
  mutate(n = row_number()) %>% 
  separate_rows(seq, sep = ' ') %>% 
  group_by(n, Freq) %>% 
  mutate(n2 = row_number()) %>% 
  spread(n2, seq)

This works on that example because all of the elements in the 'seq' column have a length of 1. I can't seem to get anything analogous to work in my example. Everything I do to split the geoAppendix strings leaves me elements of length 3, 4, etc.

technocrat · November 17, 2023, 9:58pm

So, yes, it's feasible to hardwire three distinct patterns, but for several hundred not. It might be worthwhile to try splitting off the column on the last space and classifying the patterns to see if they reflect some discernible set of encoding schemes. For example 5A and 5a could be treated similarily with toupper(). Of course if you had an uncompressed list of precincts you'd be using that.

If there are still too darn many, the next resort would be to machine learning, which we can try to puzzle out if the attempt to reduce the number of variations to a management number by harmonizing simple differences in punctuation and capitalization, etc.

technocrat · November 19, 2023, 5:35am

I'm more pessimistic after taking a sample of the MIT Election Database results for 2020. Two-thirds as straightforward but the last category looks to be a killer

# Load required libraries
library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Read the CSV file
precincts <- read.csv("https://gist.githubusercontent.com/technocrat/6ce362aef49ff0a99b188faadd68d28b/raw/3fd015c370676f2aa0459288cb77df0bea3b8fff/precincts.csv", stringsAsFactors = FALSE)

# Function to classify precincts
classify_precinct <- function(precinct) {
  has_Numeral <- str_detect(precinct, "\\d")
  has_letter <- str_detect(precinct, "[a-zA-Z]")
  has_punctuation <- str_detect(precinct, "[ ~[:punct:]]")
  
  if (has_Numeral & has_letter & has_punctuation) {
    return("Numeral, Letter, Punctuation")
  } else if (has_Numeral & has_letter) {
    return("Numeral, Letter")
  } else if (has_Numeral & has_punctuation) {
    return("Numeral, Punctuation")
  } else if (has_letter & has_punctuation) {
    return("Letter, Punctuation")
  } else if (has_Numeral) {
    return("Numeral")
  } else if (has_letter) {
    return("Letter")
  } else if (has_punctuation) {
    return("Punctuation")
  } else {
    return("Other")
  }
}

# Apply the function to classify precincts
precincts$Category <- sapply(precincts$precinct, classify_precinct)

precincts |> group_by(Category) |> count()
#> # A tibble: 6 × 2
#> # Groups:   Category [6]
#>   Category                         n
#>   <chr>                        <int>
#> 1 Letter                          21
#> 2 Letter, Punctuation             27
#> 3 Numeral                         74
#> 4 Numeral, Letter                 18
#> 5 Numeral, Letter, Punctuation   171
#> 6 Numeral, Punctuation           173

precincts |> filter(Category == 'Numeral') |> 
             select(precinct) |> head()
#>   precinct
#> 1        1
#> 2        2
#> 3        3
#> 4        4
#> 5        5
#> 6        7

precincts |> filter(Category == 'Letter') |> 
             select(precinct) |> head()
#>     precinct
#> 1     AITKIN
#> 2    BOOKMAN
#> 3    BOONDOX
#> 4    BOSWELL
#> 5 COURTHOUSE
#> 6 FELSENTHAL

precincts |> filter(Category == 'Numeral, Letter') |> 
             select(precinct) |> head()
#>   precinct
#> 1    4MSIC
#> 2     AV51
#> 3   DRP031
#> 4     F183
#> 5     FP04
#> 6      G03

precincts |> filter(Category == 'Numeral, Punctuation') |> 
             select(precinct) |> head()
#>    precinct
#> 1      45.1
#> 2 8200031.0
#> 3 8700032.0
#> 4     00 15
#> 5     00 18
#> 6    00 192

precincts |> filter(Category == 'Numeral, Letter, Punctuation') |> 
             select(precinct)
#>                            precinct
#> 1          (B21) CITY OF BETTENDORF
#> 2                   #10 PORTER PARK
#> 3                            00 11A
#> 4       0018 CLARKSTOWN 19 (WARD 4)
#> 5                          01 101 B
#> 6                    0237 FOOTHILLS
#> 7                0250 COLONIE ED 43
#> 8            0313 NEW SCOTLAND ED 8
#> 9                            05 01B
#> 10                       05-1 LEOMA
#> 11               0526 RANCHO MIRAGE
#> 12        06013DANV104_002011007016
#> 13        06013PINL105_002005009015
#> 14        06013RICH150_002011009015
#> 15       060371500027E_003032022048
#> 16       060371850056C_003040032058
#> 17       060372000053A_003040024051
#> 18       060373750009B_001027025041
#> 19       060374400024B_003027025041
#> 20       060376050001A_001029018039
#> 21       060377750040A_003028026050
#> 22       060379001440A_003033026050
#> 23       060379002686A_003029018046
#> 24       060379002890A_003037030054
#> 25       060379990331A_003032022048
#> 26       060379990859A_003043030064
#> 27        06071ONT1406_001035020052
#> 28        06071UNI0279_001008020047
#> 29                        1, WARD 2
#> 30                       10, WARD 2
#> 31                    102 - TIP TOP
#> 32         107_107 LAUREL MANOR REC
#> 33                        110 DECLO
#> 34               13_COUNTY FLOATING
#> 35                        1B LATHAM
#> 36    2_PRE 2 HAVANA PUBLIC LIBRARY
#> 37       208_208 - CLUB CONTINENTAL
#> 38                      208_PCT 208
#> 39                          216 CP2
#> 40                          24-AVCB
#> 41                        3, WARD 1
#> 42                 4 NORTH ABERDEEN
#> 43               556_PRECINCT 556.0
#> 44                  6_LEE CITY HALL
#> 45                   74_PRECINCT 74
#> 46               779_PRECINCT 779.0
#> 47                        8-D VESTA
#> 48               857_PRECINCT 857.0
#> 49  9_9 - PORT CHARLOTTE BEACH PARK
#> 50                             9-1E
#> 51                          ADAMS 3
#> 52                  ANDOVER_WDR-3-1
#> 53                BLOOMINGDALE  055
#> 54                   BROOKHAVEN 249
#> 55                   BROOKHAVEN 256
#> 56                   BROOKHAVEN 267
#> 57                        BURTON 1C
#> 58                        CANTON  9
#> 59                    CASEYVILLE 11
#> 60                CAUSEWAY BRANCH 1
#> 61                  CEDAR RAPIDS 23
#> 62         CITY OF APPLETON WARD 13
#> 63        CITY OF GREEN BAY WARD 10
#> 64        CITY OF JANESVILLE WARD 3
#> 65        CITY OF JANESVILLE WARD 9
#> 66        CITY OF MILWAUKEE WARD 16
#> 67       CITY OF MILWAUKEE WARD 243
#> 68         CITY OF NEW ROCHELLE - 4
#> 69           CITY OF WAUSAU WARD 32
#> 70                          CLAY 08
#> 71                    CLIFTY - 3700
#> 72                        COLS 83-E
#> 73                         CONC 001
#> 74                    COVENTRY 0605
#> 75                        DEKALB 03
#> 76      DIST. 4, CARPENTER PRECINCT
#> 77        DISTRICT 11-STRONG SCHOOL
#> 78                      DULUTH P-32
#> 79                EAGLE MOUNTAIN 09
#> 80                       EDINA P-12
#> 81                        EVERETT 3
#> 82                     FAIR OAKS 04
#> 83               FARMINGTON 08:UBC-
#> 84                 FRANKLIN GROVE 2
#> 85                         GREECE 3
#> 86                        GREECE 68
#> 87                    GURDON WARD 5
#> 88                         HAMB 021
#> 89                         HARLEM 2
#> 90                  HARRISON - 4450
#> 91                HOISINGTON WARD 3
#> 92                        ISLIP 119
#> 93                        JAMAICA 1
#> 94                KAYSVILLE 22:I-C-
#> 95                         LANC 010
#> 96                   LAYTON 38:I-N-
#> 97                 LEG. DIST. 26 12
#> 98                          LEHI 23
#> 99                          LERAY 7
#> 100                      LORRAINE 1
#> 101                       MANTENO 2
#> 102                  MARLBORO_WDH-5
#> 103              MASON CITY W-4 P-1
#> 104                 MIDDLETOWN 1902
#> 105            MINNEAPOLIS W-3 P-03
#> 106                 MT. PLEASANT 20
#> 107                  MT. STERLING 1
#> 108                   NAN_MID - 014
#> 109                        NOBLE 09
#> 110                  NORTH LOGAN 06
#> 111           NORTH SMITHFIELD 2502
#> 112                        OMAK #10
#> 113             ONEONTA CITY WARD 6
#> 114                  PALM TREE - D3
#> 115                       PATOKA 11
#> 116                        PATTON 7
#> 117                         PCT 260
#> 118                         PCT 423
#> 119                         PCT 593
#> 120                         PER 044
#> 121                     PERINTON 30
#> 122                          PERU 7
#> 123                    PLATTEKILL 5
#> 124                       POMFRET 8
#> 125                   PRECINCT 0005
#> 126                   PRECINCT 0031
#> 127                     PRECINCT 12
#> 128                     PRECINCT 25
#> 129                     PRECINCT 27
#> 130                    PRECINCT 317
#> 131                     PRECINCT 40
#> 132                     PRECINCT 49
#> 133                      PRECINCT 6
#> 134                     PRECINCT 93
#> 135                 PRECINCT NO. 12
#> 136             PRECINCT TOLEDO 21H
#> 137                 PROVIDENCE 2817
#> 138                           RAY 1
#> 139                 RENO-VERDI 1040
#> 140                RIVER HEIGHTS 02
#> 141                     RIVERHEAD 6
#> 142                 ROCHESTER W6 P5
#> 143                   SANDY CREEK 3
#> 144               ST. PAUL W-1 P-04
#> 145      STAMPS WARD 1, PCT 2 (OUT)
#> 146                      SUBURBAN 2
#> 147        TOWN OF CATSKILL 14 LD 1
#> 148    TOWN OF DARLINGTON WARDS 1-3
#> 149           TOWN OF DURHAM 2 LD 9
#> 150          TOWN OF FIFIELD WARD 2
#> 151             TOWN OF KNOX WARD 1
#> 152               TOWN OF LINCOLN 2
#> 153   TOWN OF POUGHKEEPSIE W 6 ED 2
#> 154    TOWN OF ROCKBRIDGE WARDS 1-3
#> 155                TOWN OF RYE - 16
#> 156        TOWN OF SHARON WARDS 1-3
#> 157      TOWN OF WAPPINGER W 1 ED 4
#> 158                         TROY 19
#> 159                    URBANDALE 10
#> 160                 VERGENNES_ADD-3
#> 161    VILLAGE OF GENOA CITY WARD 5
#> 162      VILLAGE OF MONTFORT WARD 1
#> 163                       W2 - P626
#> 164             WARD 07 PRECINCT 25
#> 165             WARD 09 PRECINCT 01
#> 166                         WARD 11
#> 167             WARD 11 PRECINCT 12
#> 168              WARD 4 PRECINCT 02
#> 169          WEST BOUNTIFUL 03:I-S-
#> 170                   WILLIAMSON 03
#> 171                       YORK  018

^{Created on 2023-11-18 with reprex v2.0.2}

speedreader · December 1, 2023, 8:41pm

Sorry for the delayed reply. I had some work projects get moved up with some holiday travel in the middle. Hope to get back into this data and look at your reply by next week. I think the MIT election data and the Harvard files are the same datasets (and I maybe pulled from the latter; will have to check to make sure).

Thanks for the time and effort and let me see how much further along I get before replying in full.

system · January 12, 2024, 8:41pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.