Hello! I've been working with IPUMS ACS microdata for just my state and while the files are large, I have no issues processing them with my R scripts that utilize dplyr (mutate, rename, filter, etc.) However, I have found I need to do an additional analysis using the national ACS data and the resulting dataframe once imported into Rstudio is 1.5 million records and 65 variables. Obviously, I'm receiving "Error: cannot allocate vector of size XX.X Mb" when I try to run my script. I've included the relevant portion of my script below along with the library call statement (which includes some libs not used in this portion of script). What would be the best way to alter this script to get around the RAM limitations? I've tried closing all other programs and letting this run for hours and it doesn't seem to make any progress and eventually just times out. Thank you for any advice you can give! I don't really want to sample the data as ACS microdata is already a sample of the larger survey data and I cannot get more RAM and I do not have access to AWS/cloud storage.
suppressPackageStartupMessages({
library(ipumsr) # import ACS microdata from IPUMS.org
library(acs) # useful functions for working w/ ACS data
library(tidyverse) # many useful syntax and QOL functions
library(labelled) # labels variables/values
library(janitor) # cleans up column names
library(stringr) # for working with strings
library(knitr) # for RMarkdown to HTML
library(rmarkdown) # to generate Rmd final document
library(flexdashboard) # to generate Rmd dashboard
library(sf) # simple features for mapping
library(leaflet) # creates interactive maps
library(testit) # to run assertions to verify data
library(tigris) # import TIGRIS shapefile data
library(htmlwidgets) # export leaflet maps to HTML
})
# generate grouping variables, filters, geo-joining variable
acs_df_sa <- acs_df_sa %>%
# remove institutional inmates, 6090 records
# new total: 290829 records
filter(relate != 13) %>%
# create age grouping
mutate(agegroup = case_when(age >= 0 & age < 19 ~ "0 - 18",
age >= 19 & age < 35 ~ "19 - 34",
age >= 35 & age < 65 ~ "35 - 64",
age >= 65 ~ "65 +")) %>%
# create 2nd age grouping
mutate(agegrp2 = case_when(age >= 0 & age < 19 ~ "0 - 18",
age >= 19 & age < 35 ~ "19 - 34",
age >= 35 & age < 45 ~ "35 - 44",
age >= 45 & age < 55 ~ "45 - 54",
age >= 55 & age < 65 ~ "55 - 64",
age >= 65 ~ "65+")) %>%
# FPL groupings for ease of mapping
mutate(pov_group = case_when(poverty >= 0 & poverty < 133 ~ "Less than 133% of FPL",
poverty >= 133 & poverty < 138 ~ "133-138% of FPL",
poverty >= 139 & poverty <= 150 ~ "139-150% of FPL",
poverty >= 151 & poverty <= 200 ~ "151-200% of FPL",
poverty >= 201 & poverty <= 250 ~ "201-250% of FPL",
poverty >= 251 & poverty <= 299 ~ "251-299% of FPL",
poverty >= 300 & poverty <= 400 ~ "300-400% of FPL",
poverty > 400 ~ "More than 400% FPL")) %>%
# create race/ethnicity mappings
mutate(race = case_when(race==1 ~ "White",
race==2 ~ "Black",
race==3 ~ "Native",
race==4 ~ "Chinese",
race==5 ~ "Japanese",
race==6 ~ "Asian",
race==7 ~ "Other",
race==8 ~ "Biracial",
race==9 ~ "Multiracial")) %>%
mutate(race = ifelse(hispan %in% c(1:4),"Hispanic",race)) %>%
# create a GEOID variable to join to shape file for PUMA 2010
mutate(puma = as.numeric(puma)) %>%
mutate(puma = sprintf("%05d", puma)) %>%
mutate(GEOID10 = paste(statefip, puma, sep = "")) %>%
# rename citizen parent/spouse variables for readability
rename(citmom = citizen_mom) %>%
rename(citmom2 = citizen_mom2) %>%
rename(citpop = citizen_pop) %>%
rename(citpop2 = citizen_pop2) %>%
rename(citsp = citizen_sp)