Best way to work with very large (1.5 million record) dataframe using dplyr (or alternative if not possible)

helanthiusfan · December 7, 2020, 7:28pm

Hello! I've been working with IPUMS ACS microdata for just my state and while the files are large, I have no issues processing them with my R scripts that utilize dplyr (mutate, rename, filter, etc.) However, I have found I need to do an additional analysis using the national ACS data and the resulting dataframe once imported into Rstudio is 1.5 million records and 65 variables. Obviously, I'm receiving "Error: cannot allocate vector of size XX.X Mb" when I try to run my script. I've included the relevant portion of my script below along with the library call statement (which includes some libs not used in this portion of script). What would be the best way to alter this script to get around the RAM limitations? I've tried closing all other programs and letting this run for hours and it doesn't seem to make any progress and eventually just times out. Thank you for any advice you can give! I don't really want to sample the data as ACS microdata is already a sample of the larger survey data and I cannot get more RAM and I do not have access to AWS/cloud storage.

suppressPackageStartupMessages({
  library(ipumsr) # import ACS microdata from IPUMS.org
  library(acs) # useful functions for working w/ ACS data
  library(tidyverse) # many useful syntax and QOL functions
  library(labelled) # labels variables/values
  library(janitor) # cleans up column names
  library(stringr) # for working with strings
  library(knitr) # for RMarkdown to HTML
  library(rmarkdown) # to generate Rmd final document
  library(flexdashboard) # to generate Rmd dashboard
  library(sf) # simple features for mapping
  library(leaflet) # creates interactive maps
  library(testit) # to run assertions to verify data
  library(tigris) # import TIGRIS shapefile data
  library(htmlwidgets) # export leaflet maps to HTML
})
# generate grouping variables, filters, geo-joining variable
acs_df_sa <- acs_df_sa %>%
  # remove institutional inmates, 6090 records
  # new total: 290829 records
  filter(relate != 13) %>%
  # create age grouping
  mutate(agegroup = case_when(age >= 0 & age < 19 ~ "0 - 18",
                              age >= 19 & age < 35 ~ "19 - 34",
                              age >= 35 & age < 65 ~ "35 - 64",
                              age >= 65 ~ "65 +")) %>%
  # create 2nd age grouping
  mutate(agegrp2 = case_when(age >= 0 & age < 19 ~ "0 - 18",
                             age >= 19 & age < 35 ~ "19 - 34",
                             age >= 35 & age < 45 ~ "35 - 44",
                             age >= 45 & age < 55 ~ "45 - 54",
                             age >= 55 & age < 65 ~ "55 - 64",
                             age >= 65 ~ "65+")) %>% 
  # FPL groupings for ease of mapping
  mutate(pov_group = case_when(poverty >= 0 & poverty < 133 ~ "Less than 133% of FPL",
                               poverty >= 133 & poverty < 138 ~ "133-138% of FPL",
                               poverty >= 139 & poverty <= 150 ~ "139-150% of FPL",
                               poverty >= 151 & poverty <= 200 ~ "151-200% of FPL",
                               poverty >= 201 & poverty <= 250 ~ "201-250% of FPL",
                               poverty >= 251 & poverty <= 299 ~ "251-299% of FPL",
                               poverty >= 300 & poverty <= 400 ~ "300-400% of FPL",
                               poverty > 400 ~ "More than 400% FPL")) %>%
  # create race/ethnicity mappings
  mutate(race = case_when(race==1 ~ "White",
                          race==2 ~ "Black",
                          race==3 ~ "Native",
                          race==4 ~ "Chinese",
                          race==5 ~ "Japanese",
                          race==6 ~ "Asian",
                          race==7 ~ "Other",
                          race==8 ~ "Biracial",
                          race==9 ~ "Multiracial")) %>%
  mutate(race = ifelse(hispan %in% c(1:4),"Hispanic",race)) %>%
  # create a GEOID variable to join to shape file for PUMA 2010
  mutate(puma = as.numeric(puma)) %>%
  mutate(puma = sprintf("%05d", puma)) %>%
  mutate(GEOID10 = paste(statefip, puma, sep = "")) %>%
  # rename citizen parent/spouse variables for readability
  rename(citmom = citizen_mom) %>%
  rename(citmom2 = citizen_mom2) %>%
  rename(citpop = citizen_pop) %>%
  rename(citpop2 = citizen_pop2) %>%
  rename(citsp = citizen_sp)

nirgrahamuk · December 8, 2020, 5:07pm

The ipumsr package you import from provides functions to work on large data in a chunked fashion.
read about it by typing the following command

vignette(package="ipumsr",topic = "ipums-bigdata")

look for option 3

system · December 15, 2020, 5:07pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.