I am trying to build a database of 9-digit zip codes from the 5-digit zip code and "Plus4Low" and "Plus4High." For example, zip code 37401 has Plus4Low of 0102, and Plus4High of 0151. This means that there are unique 9-digit zip codes 37401-0102, 37401-0103,...,37401-0151. So I need to end up with 50 copies of the original record with a "Plus4" field populated with 0102, 0103,...0151. There are thousands of such zip codes with varying numbers of copies that need to be made in order to create the complete 9-digit database. What is an efficient way to do this?
There may be a more clever way to do this but here's my thought on how I would do it with a small example:
library(tidyverse)
datinit <- tibble (zip=c(37401, 37402),
Plus4Low=c(0102, 0150),
Plus4High=c(0105, 152))
datinit %>%
pivot_longer(-zip, names_to="TypeSuff", values_to="Suff") %>%
group_by(zip) %>% #this is to sequences within zip
complete(Suff=full_seq(Suff, 1)) %>% #make +4 sequence within zip
ungroup() %>% #remove group so we don't accidentally have it later
mutate(ZipPlus4=str_c(zip, str_pad(Suff, 4, "left", "0"), sep="-")) #make zip+4
#> # A tibble: 7 x 4
#> zip Suff TypeSuff ZipPlus4
#> <dbl> <dbl> <chr> <chr>
#> 1 37401 102 Plus4Low 37401-0102
#> 2 37401 103 <NA> 37401-0103
#> 3 37401 104 <NA> 37401-0104
#> 4 37401 105 Plus4High 37401-0105
#> 5 37402 150 Plus4Low 37402-0150
#> 6 37402 151 <NA> 37402-0151
#> 7 37402 152 Plus4High 37402-0152
Created on 2020-06-15 by the reprex package (v0.3.0)
Thank you, this looks very promising. Where I'm struggling now is how to apply this to a dataset that has thousands of zips, and additional variables. In other words, the way datainit is defined in the example tells the code that follows to run the sequence from Plus4Low to Plus4High. I'm sure there is some simple way to tell the code to sequence from Plus4Low to Plus4High without typing all the vectors by hand, but I'm not having any luck finding it.
It might be helpful to see what your data looks like. You won't need to manually type out the codes. In R, enter dput(head(datasetname))
for whateveryour dataset name is. Then paste the output into this message board inside a code chunk by entering it in a code chunk (FAQ: How to format your code).
Does this help?
ZipCode Plus4Low Plus4High CenLat CenLon
37401 0102 0151 35.04580 -85.30653
38231 0401 0520 36.20229 -88.42088
38231 0521 0580 36.20229 -88.42088
37056 0001 0018 0.00000 0.00000
37056 0021 0028 0.00000 0.00000
37056 0031 0048 0.00000 0.00000
Not really, I can't easily copy/paste that into R. Please use dput(head(data)) and paste that into hear. That will look like this (using mtcars as an example):
structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1), cyl = c(6,
6, 4, 6, 8, 6), disp = c(160, 160, 108, 258, 360, 225), hp = c(110,
110, 93, 110, 175, 105), drat = c(3.9, 3.9, 3.85, 3.08, 3.15,
2.76), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46), qsec = c(16.46,
17.02, 18.61, 19.44, 17.02, 20.22), vs = c(0, 0, 1, 1, 0, 1),
am = c(1, 1, 1, 0, 0, 0), gear = c(4, 4, 4, 3, 3, 3), carb = c(4,
4, 1, 1, 2, 1)), row.names = c("Mazda RX4", "Mazda RX4 Wag",
"Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant"
), class = "data.frame")
dput returns a list of values that is more than 65,000 characters long. I don't think it is even all fitting in the Console, but when I try to take what it does give me and paste it here, there is an error saying the character limit has been exceeded.
I also tried keeping just the first 10 rows of data and running dput on that, and I get the same huge list of values...
@eclymer I think you did not specify head in dput(head(data))
.
With the 6 rows you did display (assuming the fields you show is the entire contents of each row) your dput
output should only have a limited number of characters.
Apart from that I think you can use the method of @StatSteph without much trouble. Just try it on the first few rows testdata = head(yourdata)
I ran the line below on a subset of the data with only 10 rows in it, and continue to get 65,000+ characters. I may be missing something simple, but I don't know offhand what it is.
dput(head(RZD_multi_10))
The 6 rows displayed above were produced by running head(RZD_multi)
, and do contain all the columns in the data.
Maybe select just these columns first - I thought that's all the columns you had.
Not sure what you mean. These are all the columns in the file.
ZipCode Plus4Low Plus4High CenLat CenLon
37401 0102 0151 35.04580 -85.30653
38231 0401 0520 36.20229 -88.42088
38231 0521 0580 36.20229 -88.42088
37056 0001 0018 0.00000 0.00000
37056 0021 0028 0.00000 0.00000
37056 0031 0048 0.00000 0.00000
Maybe you have factor data so that even with the first few rows you take with you all the contents (definitions) of the factor. Is the result of
str(yourdata)
manageable?
It is:
> str(RZD_multi) 'data.frame': 9488 obs. of 5 variables: $ ZipCode : int 37401 38231 38231 37056 37056 37056 37056 37056 37056 37056 ... $ Plus4Low : Factor w/ 10041 levels "0001","0002",..: 103 405 526 1 21 31 51 61 81 91 ... $ Plus4High: Factor w/ 10041 levels "0001","0002",..: 152 525 585 18 28 48 58 74 88 105 ... $ CenLat : num 35 36.2 36.2 0 0 ... $ CenLon : num -85.3 -88.4 -88.4 0 0 ...
>
Not sure if previous post came through in the best format. Here is another try.
Version:1.0 StartHTML:0000000107 EndHTML:0000001386 StartFragment:0000000127 EndFragment:0000001368
'data.frame': 9488 obs. of 5 variables: $ ZipCode : int 37401 38231 38231 37056 37056 37056 37056 37056 37056 37056 ... $ Plus4Low : Factor w/ 10041 levels "0001","0002",..: 103 405 526 1 21 31 51 61 81 91 ... $ Plus4High: Factor w/ 10041 levels "0001","0002",..: 152 525 585 18 28 48 58 74 88 105 ... $ CenLat : num 35 36.2 36.2 0 0 ... $ CenLon : num -85.3 -88.4 -88.4 0 0 ...
recommend you take the head of your data, drop unused factor levels and dput that
note the n= param of head that lets you control the size of the head to keep
library(tidyverse)
head(RZD_multi_10,n=10) %>% mutate_if(is.factor,forcats::fct_drop) %>% dput()
OK, here is what the code recommended by nirgrahamuk returns:
Version:1.0 StartHTML:0000000107 EndHTML:0000001659 StartFragment:0000000127 EndFragment:0000001641
structure(list(ZipCode = c(37401L, 38231L, 38231L, 37056L, 37056L, 37056L, 37056L, 37056L, 37056L, 37056L), Plus4Low = structure(c(8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("0001", "0021", "0031", "0051", "0061", "0081", "0091", "0102", "0401", "0521" ), class = "factor"), Plus4High = structure(c(8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("0018", "0028", "0048", "0058", "0074", "0088", "0104", "0151", "0520", "0580"), class = "factor"), CenLat = c(35.045799, 36.202288, 36.202288, 0, 0, 0, 0, 0, 0, 0), CenLon = c(-85.306528, -88.420878, -88.420878, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -10L))
library(tidyverse)
(datinit <- structure(list(ZipCode = c(37401L, 38231L, 38231L, 37056L, 37056L, 37056L, 37056L, 37056L,
37056L, 37056L), Plus4Low = structure(c(8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("0001",
"0021", "0031", "0051", "0061", "0081", "0091", "0102", "0401", "0521" ), class = "factor"), Plus4High =
structure(c(8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("0018", "0028", "0048", "0058", "0074",
"0088", "0104", "0151", "0520", "0580"), class = "factor"), CenLat = c(35.045799, 36.202288, 36.202288,
0, 0, 0, 0, 0, 0, 0), CenLon = c(-85.306528, -88.420878, -88.420878, 0, 0, 0, 0, 0, 0, 0)), class =
"data.frame", row.names = c(NA, -10L)) %>% as_tibble())
datinit %>%
mutate_at(vars(starts_with("Plus")),
~as.integer(as.character(.))) %>%
rowwise() %>%
mutate(slist = list(Plus4Low:Plus4High)) %>%
unnest(slist) %>%
mutate(ZipPlus4=str_c(ZipCode, str_pad(slist, 4, "left", "0"), sep="-"))
I assume you read in your data using read.csv
and reads strings as factors which takes up more room and isn't needed. I'd suggest either using read.csv(filepath, stringsAsFactors=FALSE)
or use readr::read_csv(filepath)
which does it by default.
This seems better:
Version:1.0 StartHTML:0000000107 EndHTML:0000001388 StartFragment:0000000127 EndFragment:0000001370
structure(list(ZipCode = c(37401L, 38231L, 38231L, 37056L, 37056L, 37056L), Plus4Low = c("0102", "0401", "0521", "0001", "0021", "0031"), Plus4High = c("0151", "0520", "0580", "0018", "0028", "0048"), CenLat = c(35.045799, 36.202288, 36.202288, 0, 0, 0), CenLon = c(-85.306528, -88.420878, -88.420878, 0, 0, 0)), row.names = c(143L, 217L, 218L, 910L, 911L, 912L), class = "data.frame")
��
Great, so only a slight modification from @nigrahamuk's solution. You no longer need to convert from factor to character to integer but only character to integer. You will use your actual data rather than a small subset as here. No need to type the data as you thought.
library(tidyverse)
(datinit <- structure(list(ZipCode = c(37401L, 38231L, 38231L, 37056L, 37056L, 37056L), Plus4Low = c("0102", "0401", "0521", "0001", "0021", "0031"), Plus4High = c("0151", "0520", "0580", "0018", "0028", "0048"), CenLat = c(35.045799, 36.202288, 36.202288, 0, 0, 0), CenLon = c(-85.306528, -88.420878, -88.420878, 0, 0, 0)), row.names = c(143L, 217L, 218L, 910L, 911L, 912L), class = "data.frame"))
#> ZipCode Plus4Low Plus4High CenLat CenLon
#> 143 37401 0102 0151 35.04580 -85.30653
#> 217 38231 0401 0520 36.20229 -88.42088
#> 218 38231 0521 0580 36.20229 -88.42088
#> 910 37056 0001 0018 0.00000 0.00000
#> 911 37056 0021 0028 0.00000 0.00000
#> 912 37056 0031 0048 0.00000 0.00000
datinit %>%
mutate_at(vars(starts_with("Plus")),
~as.integer(.)) %>%
rowwise() %>%
mutate(slist = list(Plus4Low:Plus4High)) %>%
unnest(slist) %>%
mutate(ZipPlus4=str_c(ZipCode, str_pad(slist, 4, "left", "0"), sep="-"))
#> # A tibble: 274 x 7
#> ZipCode Plus4Low Plus4High CenLat CenLon slist ZipPlus4
#> <int> <int> <int> <dbl> <dbl> <int> <chr>
#> 1 37401 102 151 35.0 -85.3 102 37401-0102
#> 2 37401 102 151 35.0 -85.3 103 37401-0103
#> 3 37401 102 151 35.0 -85.3 104 37401-0104
#> 4 37401 102 151 35.0 -85.3 105 37401-0105
#> 5 37401 102 151 35.0 -85.3 106 37401-0106
#> 6 37401 102 151 35.0 -85.3 107 37401-0107
#> 7 37401 102 151 35.0 -85.3 108 37401-0108
#> 8 37401 102 151 35.0 -85.3 109 37401-0109
#> 9 37401 102 151 35.0 -85.3 110 37401-0110
#> 10 37401 102 151 35.0 -85.3 111 37401-0111
#> # ... with 264 more rows
Created on 2020-06-18 by the reprex package (v0.3.0)