create multiple data records from a single record

eclymer · June 15, 2020, 9:34pm

I am trying to build a database of 9-digit zip codes from the 5-digit zip code and "Plus4Low" and "Plus4High." For example, zip code 37401 has Plus4Low of 0102, and Plus4High of 0151. This means that there are unique 9-digit zip codes 37401-0102, 37401-0103,...,37401-0151. So I need to end up with 50 copies of the original record with a "Plus4" field populated with 0102, 0103,...0151. There are thousands of such zip codes with varying numbers of copies that need to be made in order to create the complete 9-digit database. What is an efficient way to do this?

StatSteph · June 16, 2020, 1:08am

There may be a more clever way to do this but here's my thought on how I would do it with a small example:

library(tidyverse)

datinit <- tibble (zip=c(37401, 37402), 
                   Plus4Low=c(0102, 0150), 
                   Plus4High=c(0105, 152))

datinit %>%
  pivot_longer(-zip, names_to="TypeSuff", values_to="Suff") %>% 
  group_by(zip) %>% #this is to sequences within zip
  complete(Suff=full_seq(Suff, 1)) %>% #make +4 sequence within zip
  ungroup() %>% #remove group so we don't accidentally have it later
  mutate(ZipPlus4=str_c(zip, str_pad(Suff, 4, "left", "0"), sep="-")) #make zip+4 
#> # A tibble: 7 x 4
#>     zip  Suff TypeSuff  ZipPlus4  
#>   <dbl> <dbl> <chr>     <chr>     
#> 1 37401   102 Plus4Low  37401-0102
#> 2 37401   103 <NA>      37401-0103
#> 3 37401   104 <NA>      37401-0104
#> 4 37401   105 Plus4High 37401-0105
#> 5 37402   150 Plus4Low  37402-0150
#> 6 37402   151 <NA>      37402-0151
#> 7 37402   152 Plus4High 37402-0152

^{Created on 2020-06-15 by the reprex package (v0.3.0)}

eclymer · June 16, 2020, 2:14pm

Thank you, this looks very promising. Where I'm struggling now is how to apply this to a dataset that has thousands of zips, and additional variables. In other words, the way datainit is defined in the example tells the code that follows to run the sequence from Plus4Low to Plus4High. I'm sure there is some simple way to tell the code to sequence from Plus4Low to Plus4High without typing all the vectors by hand, but I'm not having any luck finding it.

StatSteph · June 16, 2020, 3:17pm

It might be helpful to see what your data looks like. You won't need to manually type out the codes. In R, enter dput(head(datasetname)) for whateveryour dataset name is. Then paste the output into this message board inside a code chunk by entering it in a code chunk (FAQ: How to format your code).

eclymer · June 16, 2020, 3:55pm

Does this help?

ZipCode Plus4Low Plus4High CenLat CenLon
37401 0102 0151 35.04580 -85.30653 
38231 0401 0520 36.20229 -88.42088 
38231 0521 0580 36.20229 -88.42088
37056 0001 0018 0.00000 0.00000 
37056 0021 0028 0.00000 0.00000 
37056 0031 0048 0.00000 0.00000

StatSteph · June 16, 2020, 5:34pm

Not really, I can't easily copy/paste that into R. Please use dput(head(data)) and paste that into hear. That will look like this (using mtcars as an example):

structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1), cyl = c(6, 
6, 4, 6, 8, 6), disp = c(160, 160, 108, 258, 360, 225), hp = c(110, 
110, 93, 110, 175, 105), drat = c(3.9, 3.9, 3.85, 3.08, 3.15, 
2.76), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46), qsec = c(16.46, 
17.02, 18.61, 19.44, 17.02, 20.22), vs = c(0, 0, 1, 1, 0, 1), 
    am = c(1, 1, 1, 0, 0, 0), gear = c(4, 4, 4, 3, 3, 3), carb = c(4, 
    4, 1, 1, 2, 1)), row.names = c("Mazda RX4", "Mazda RX4 Wag", 
"Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant"
), class = "data.frame")

eclymer · June 16, 2020, 6:35pm

dput returns a list of values that is more than 65,000 characters long. I don't think it is even all fitting in the Console, but when I try to take what it does give me and paste it here, there is an error saying the character limit has been exceeded.

I also tried keeping just the first 10 rows of data and running dput on that, and I get the same huge list of values...

HanOostdijk · June 16, 2020, 7:51pm

@eclymer I think you did not specify head in dput(head(data)).
With the 6 rows you did display (assuming the fields you show is the entire contents of each row) your dput output should only have a limited number of characters.
Apart from that I think you can use the method of @StatSteph without much trouble. Just try it on the first few rows testdata = head(yourdata)

eclymer · June 16, 2020, 8:26pm

I ran the line below on a subset of the data with only 10 rows in it, and continue to get 65,000+ characters. I may be missing something simple, but I don't know offhand what it is.

dput(head(RZD_multi_10))

The 6 rows displayed above were produced by running head(RZD_multi), and do contain all the columns in the data.

StatSteph · June 16, 2020, 8:32pm

Maybe select just these columns first - I thought that's all the columns you had.

eclymer · June 16, 2020, 10:29pm

Not sure what you mean. These are all the columns in the file.

ZipCode Plus4Low Plus4High CenLat CenLon
37401 0102 0151 35.04580 -85.30653 
38231 0401 0520 36.20229 -88.42088 
38231 0521 0580 36.20229 -88.42088
37056 0001 0018 0.00000 0.00000 
37056 0021 0028 0.00000 0.00000 
37056 0031 0048 0.00000 0.00000

HanOostdijk · June 17, 2020, 12:15am

Maybe you have factor data so that even with the first few rows you take with you all the contents (definitions) of the factor. Is the result of

str(yourdata)

manageable?

eclymer · June 17, 2020, 12:11pm

It is:

> str(RZD_multi) 'data.frame': 9488 obs. of 5 variables: $ ZipCode : int 37401 38231 38231 37056 37056 37056 37056 37056 37056 37056 ... $ Plus4Low : Factor w/ 10041 levels "0001","0002",..: 103 405 526 1 21 31 51 61 81 91 ... $ Plus4High: Factor w/ 10041 levels "0001","0002",..: 152 525 585 18 28 48 58 74 88 105 ... $ CenLat : num 35 36.2 36.2 0 0 ... $ CenLon : num -85.3 -88.4 -88.4 0 0 ...

>

eclymer · June 17, 2020, 12:13pm

Not sure if previous post came through in the best format. Here is another try.

Version:1.0 StartHTML:0000000107 EndHTML:0000001386 StartFragment:0000000127 EndFragment:0000001368

'data.frame': 9488 obs. of 5 variables: $ ZipCode : int 37401 38231 38231 37056 37056 37056 37056 37056 37056 37056 ... $ Plus4Low : Factor w/ 10041 levels "0001","0002",..: 103 405 526 1 21 31 51 61 81 91 ... $ Plus4High: Factor w/ 10041 levels "0001","0002",..: 152 525 585 18 28 48 58 74 88 105 ... $ CenLat : num 35 36.2 36.2 0 0 ... $ CenLon : num -85.3 -88.4 -88.4 0 0 ...

nirgrahamuk · June 17, 2020, 12:38pm

recommend you take the head of your data, drop unused factor levels and dput that
note the n= param of head that lets you control the size of the head to keep

library(tidyverse)
head(RZD_multi_10,n=10) %>%  mutate_if(is.factor,forcats::fct_drop) %>% dput()

eclymer · June 17, 2020, 1:01pm

OK, here is what the code recommended by nirgrahamuk returns:

Version:1.0 StartHTML:0000000107 EndHTML:0000001659 StartFragment:0000000127 EndFragment:0000001641

structure(list(ZipCode = c(37401L, 38231L, 38231L, 37056L, 37056L, 37056L, 37056L, 37056L, 37056L, 37056L), Plus4Low = structure(c(8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("0001", "0021", "0031", "0051", "0061", "0081", "0091", "0102", "0401", "0521" ), class = "factor"), Plus4High = structure(c(8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("0018", "0028", "0048", "0058", "0074", "0088", "0104", "0151", "0520", "0580"), class = "factor"), CenLat = c(35.045799, 36.202288, 36.202288, 0, 0, 0, 0, 0, 0, 0), CenLon = c(-85.306528, -88.420878, -88.420878, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -10L))

nirgrahamuk · June 17, 2020, 1:36pm

library(tidyverse)

(datinit <- structure(list(ZipCode = c(37401L, 38231L, 38231L, 37056L, 37056L, 37056L, 37056L, 37056L, 
37056L, 37056L), Plus4Low = structure(c(8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("0001", 
"0021", "0031", "0051", "0061", "0081", "0091", "0102", "0401", "0521" ), class = "factor"), Plus4High = 
structure(c(8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("0018", "0028", "0048", "0058", "0074", 
"0088", "0104", "0151", "0520", "0580"), class = "factor"), CenLat = c(35.045799, 36.202288, 36.202288, 
0, 0, 0, 0, 0, 0, 0), CenLon = c(-85.306528, -88.420878, -88.420878, 0, 0, 0, 0, 0, 0, 0)), class = 
"data.frame", row.names = c(NA, -10L)) %>% as_tibble())

datinit %>% 
mutate_at(vars(starts_with("Plus")),
~as.integer(as.character(.))) %>% 
rowwise() %>%
  mutate(slist = list(Plus4Low:Plus4High)) %>% 
unnest(slist) %>% 
  mutate(ZipPlus4=str_c(ZipCode, str_pad(slist, 4, "left", "0"), sep="-"))

StatSteph · June 17, 2020, 2:19pm

I assume you read in your data using read.csv and reads strings as factors which takes up more room and isn't needed. I'd suggest either using read.csv(filepath, stringsAsFactors=FALSE) or use readr::read_csv(filepath) which does it by default.

eclymer · June 18, 2020, 6:12pm

This seems better:

Version:1.0 StartHTML:0000000107 EndHTML:0000001388 StartFragment:0000000127 EndFragment:0000001370

structure(list(ZipCode = c(37401L, 38231L, 38231L, 37056L, 37056L, 37056L), Plus4Low = c("0102", "0401", "0521", "0001", "0021", "0031"), Plus4High = c("0151", "0520", "0580", "0018", "0028", "0048"), CenLat = c(35.045799, 36.202288, 36.202288, 0, 0, 0), CenLon = c(-85.306528, -88.420878, -88.420878, 0, 0, 0)), row.names = c(143L, 217L, 218L, 910L, 911L, 912L), class = "data.frame")

��

StatSteph · June 18, 2020, 9:52pm

Great, so only a slight modification from @nigrahamuk's solution. You no longer need to convert from factor to character to integer but only character to integer. You will use your actual data rather than a small subset as here. No need to type the data as you thought.

library(tidyverse)

(datinit <- structure(list(ZipCode = c(37401L, 38231L, 38231L, 37056L, 37056L, 37056L), Plus4Low = c("0102", "0401", "0521", "0001", "0021", "0031"), Plus4High = c("0151", "0520", "0580", "0018", "0028", "0048"), CenLat = c(35.045799, 36.202288, 36.202288, 0, 0, 0), CenLon = c(-85.306528, -88.420878, -88.420878, 0, 0, 0)), row.names = c(143L, 217L, 218L, 910L, 911L, 912L), class = "data.frame"))
#>     ZipCode Plus4Low Plus4High   CenLat    CenLon
#> 143   37401     0102      0151 35.04580 -85.30653
#> 217   38231     0401      0520 36.20229 -88.42088
#> 218   38231     0521      0580 36.20229 -88.42088
#> 910   37056     0001      0018  0.00000   0.00000
#> 911   37056     0021      0028  0.00000   0.00000
#> 912   37056     0031      0048  0.00000   0.00000

datinit %>% 
  mutate_at(vars(starts_with("Plus")),
            ~as.integer(.)) %>% 
  rowwise() %>%
  mutate(slist = list(Plus4Low:Plus4High)) %>% 
  unnest(slist) %>% 
  mutate(ZipPlus4=str_c(ZipCode, str_pad(slist, 4, "left", "0"), sep="-"))
#> # A tibble: 274 x 7
#>    ZipCode Plus4Low Plus4High CenLat CenLon slist ZipPlus4  
#>      <int>    <int>     <int>  <dbl>  <dbl> <int> <chr>     
#>  1   37401      102       151   35.0  -85.3   102 37401-0102
#>  2   37401      102       151   35.0  -85.3   103 37401-0103
#>  3   37401      102       151   35.0  -85.3   104 37401-0104
#>  4   37401      102       151   35.0  -85.3   105 37401-0105
#>  5   37401      102       151   35.0  -85.3   106 37401-0106
#>  6   37401      102       151   35.0  -85.3   107 37401-0107
#>  7   37401      102       151   35.0  -85.3   108 37401-0108
#>  8   37401      102       151   35.0  -85.3   109 37401-0109
#>  9   37401      102       151   35.0  -85.3   110 37401-0110
#> 10   37401      102       151   35.0  -85.3   111 37401-0111
#> # ... with 264 more rows

^{Created on 2020-06-18 by the reprex package (v0.3.0)}