Replace dataframe values with a reference data frame?

Megan_Huber · May 26, 2020, 5:50am

I'm trying to add a column with the populations of each state in the US to data from the covdata package.

#remotes::install_github("kjhealy/covdata")
library(covdata)

# the csv has state names in one column (state_ab) and populations in a second column (pop_2019)
pops_ref <- read_csv("state_pops.csv") %>%
  mutate(pop_2019 = as.character(pop_2019))

covus %>%
  group_by(state, date) %>%
  pivot_wider(names_from = measure, values_from = count) %>%
  ungroup() %>%
  mutate(state_pop = str_replace_all(state, pops_ref$state_ab, pops_ref$pop_2019))

I get the error of "longer object length is not a multiple of shorter object length" and the state_pop column shows identical values to the state column.

I know I could write a brutal case_when function, but since there are 50 states I would prefer to not do that. Is there any way to do this easier? I spent a few hours googling around for some simple way to do a definition or recode or replace and I have come up with nothing. What am I missing?

nirgrahamuk · May 26, 2020, 6:03am

I recommend you read about dplyr joins.

Megan_Huber · May 26, 2020, 6:11am

whoops! Of course, as soon as I finally wrote this question up and gave a last-ditch google effort, I found the solution.
the plyr function mapvalues is exactly what I needed.

mutate(state_pop = mapvalues(state_pop, populs$state_abbs, populs$pop_2019))

... should I delete the question? mark it solved?

Megan_Huber · May 26, 2020, 6:13am

I tried using a variety of joins, but frankly they scare me and I thought there had to be a better solution than joining a 50-row data frame to a ~81000 row data frame.

nirgrahamuk · May 26, 2020, 6:55am

By all means mark your solution solved if you consider it so.

nirgrahamuk · May 26, 2020, 8:21am

Heres an example that compares the two approaches, it may help with join fear ?

#remotes::install_github("kjhealy/covdata")
library(covdata)


library(tidyverse)
#invent population integers associated with states
populs <- enframe(unique(covus$state),
                  name="pop_2019",
                  value="state_abbs") 

library(plyr)

library(microbenchmark)

microbenchmark(
  with_plyr = with_plyr<-covus %>%
  group_by(state, date) %>%
  pivot_wider(names_from = measure, values_from = count) %>%
  ungroup() %>%
  mutate(pop_2019 = as.integer(mapvalues(state, populs$state_abbs, populs$pop_2019)))
,
with_join = with_join<-covus %>%
  group_by(state, date) %>%
  pivot_wider(names_from = measure, values_from = count) %>%
  ungroup() %>% left_join(populs,by=c("state"="state_abbs"))
, times=100L
)
Unit: milliseconds
      expr     min       lq     mean   median       uq     max neval cld
 with_plyr 33.4725 35.34005 38.44344 36.51270 39.01455 98.3017   100   a
 with_join 32.0752 35.13425 37.51821 36.26925 38.54900 56.8271   100   a

all_equal(with_plyr,with_join)
[1] TRUE

martin.R · May 26, 2020, 9:46am

In the longer term you will be better off using dplyr than plyr functions.

This is a summary of the join functions:

And the particular function definitions:
https://dplyr.tidyverse.org/reference/join.html

The majority of use cases will involve left_join() or inner_join().

system · June 2, 2020, 9:46am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.