Large csv Census files in tidyverse

jgendrinal · October 6, 2017, 7:55am

I'm trying to put csv files and bind them together to create a large dataset of census data per year. Unfortunately, there comes a point where R can no longer allocate the data (up to 124Mb if I'm not mistaken). So it decides to just stop reading and binding files up to 2010. I've tried using gc() over and over, but it still does not seem to work

R version: Microsoft R Open 3.4.1
Rstudio version: Version 1.0.153
Windows version: Windows 10 x64 bit
CPU: Intel i3-4005U CPU 1.70GHz
RAM: 4GB

Here is the code.

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(purrr)
  library(readr)
  library(stringr)
  library(ffbase)})
# Import and filter only household head cases ----
#Standardize column specifications for each 
col_spec90 <- cols(PROVINCE = col_character(), 
                   P2_OVERSEAS_CONT = col_character(), 
                   P3_RELATIONSHIP = col_character(), 
                   P5_AGE = col_integer(), 
                   P6_SEX = col_character(), 
                   P7_MARITAL_STAT = col_character(), 
                   P8_RELIGION = col_character(), 
                   P12_DISABILITY = col_character(), 
                   P17_RES_5YRS_AGO = col_character(), 
                   P22_EDUCATION = col_character())
frame90 <- tibble()
for (i in list.files("data/CPH1990")){
file <- read_csv(str_c("data/CPH1990/", i), col_types = col_spec90) %>% 
    filter(P3_RELATIONSHIP == "01")
  frame90 <- bind_rows(frame90, file)
  rm(file)
  gc()}
rm(col_spec90, i)
gc()
col_spec00 <- cols(PROVINCE = col_character(),
                   P2H_RELHEAD = col_character(),
                   P6_AGE_LAST = col_integer(),
                   P7_SEX = col_character(),
                   P8H_OVERSEAS = col_character(),
                   P9_MSTATUS = col_character(),
                   P10_RELIGION = col_character(),
                   P13_DISABILITY = col_character(),
                   P22_EDUCATION = col_character(),
                   P23H_RES_5YR_AGO = col_character())
frame00 <- tibble()
for (i in list.files("data/CPH2000")){
  file <- read_csv(str_c("data/CPH2000/", i), col_types = col_spec00) %>% 
    filter(P2H_RELHEAD == "01")
  frame00 <- bind_rows(frame00, file)
  rm(file)
  gc()}
rm(col_spec00, i)
gc()
col_spec10 <- cols(PRV = col_character(), 
                   P2 = col_character(), 
                   P3 = col_character(), 
                   P5 = col_integer(), 
                   P7 = col_character(), 
                   P8 = col_character(), 
                   P12 = col_character(), 
                   P14 = col_character(), 
                   P16R = col_character(), 
                   P19 = col_character())
frame10 <- tibble()
for (i in list.files("data/CPH2010")){
  file <- read_csv(str_c("data/CPH2010/", i), col_types = col_spec10) %>% 
    filter(P2 == "01")
  frame10 <- bind_rows(frame10, file)
  rm(file)
  gc()}
rm(col_spec10, i)
gc()
save.image("out/frame.rda")
gc()
write_csv(frame90, "out/frame90.csv")
rm(frame90)
gc()
write_csv(frame00, "out/frame00.csv")
rm(frame00)
gc()
write_csv(frame10, "out/frame10.csv")
gc()

dlsweet · October 6, 2017, 12:10pm

Is the error you're receiving the "Error: OutOfMemoryError (Java): GC overhead limit exceeded" error? The only time I've ever encountered this issues was when I was trying to load ~100 years of NHL scores for an ELO Based Model that I've been working on.

In order to fix it, I used these commands:

memory.limit(size=70000)
options(java.parameters = "-Xmx6g")

I'm not 100% sure what exactly they do, so if someone else could explain I'd be very grateful. However, it did allow my project to load all of the requisite files.

Hope this helps, good luck!

nick · October 6, 2017, 2:29pm

This is likely your problem. How large are the CSV files you're trying to load, combined? You may not be able to fit all of them into active memory, in which case you would need to process them in smaller chunks, find an on-disk solution, or use R on a computer with more memory available.

While everyone has different constraints, I don't think 4 GB is enough memory to use R for data sets of any real size (with apologies to those that did real work in R in the pre-3.0 days).

Also, gc() does not need to be run manually unless you need R to return memory to the OS, as mentioned in Hadley's Advanced R book:
http://adv-r.had.co.nz/memory.html
Otherwise, R will automatically free up space when more is needed.

martin.R · October 6, 2017, 2:35pm

I had that precise problem when I first started with R. I was on a Win 7 32 bit PC, so was restricted to ca. usable 3.5GB.

The longer term solution was to get a new PC, but that took months in a large company with a restricted portfolio of hardware.

The shorter term solution was to use data.table for its memory efficiency. In particular rbindlist might help.

ConnorKirk · October 6, 2017, 2:38pm

I'm not able to completely answer your question, but I can point you to some help that might aid you understand your problem more. Hopefully it's enough to get you started (or until someone more wise comes along).

Here's a link to the R documentation on Memory.size. The max memory limit varies according to the type of OS and R used (i.e. 32bit v 64 bit). Either way it should be more than 124mb. You can check your memory limit with memory.size(max = TRUE)

Advanced-R contains a good chapter on memory in R. It may be able to give you more information for find the problem.

Good luck!

tpall · October 6, 2017, 3:33pm

I am not sure if this solves your memory issue, but since you are already loading "purrr" you may want to import these csv files into data_frame (please see mockup code below) and then merge tables with bind_rows().
Your current for loop approach seems too complicated.

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Warning: package 'dplyr' was built under R version 3.4.2
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
dirs <- c("data/CPH1990", "data/CPH2000", "data/CPH2010")
## Create data_frame with filenames
files <- data_frame(dir = dirs) %>% 
  mutate(files = map(dir, list.files, full.names = TRUE)) %>% 
  unnest
## Import files with read_csv or data.table::fread()
dataset <- files %>% 
  mutate(table = map(files, read_csv))

cderv · October 7, 2017, 6:17am

Could you use a database instead of writing your binded files in csv at the end ?
I think you could use SQLlite to create a database that you fill up which each csv you read.
You will have at the end a table in the database that you can request easily with dplyr after.

See this post on Memory if not already read by you, there are some good advices

thoughtfulnz · October 7, 2017, 7:03am

If it is stalling at 124 due to memory limits (and it is going to hinge on how much free RAM there is) my question would be do you need to read in all the data, or can you pick only some columns, or rows, or otherwise summarise the data when reading it in. Because if you actually only need a fraction of the data to address a particular question, then work with that fraction to achieve your goal.

selective_read <- function(x){
y <- read.csv(x, extra, settings)
z <- y %>% do %>% stuff %>% to %>%
  process %>% data
return(z)
}
dt <- bind_rows(lapply(vector_of_file_paths, selective_read))

jgendrinal · October 7, 2017, 10:37am

I'll be running a model on them combined. They are census data from different provinces and I want to get them together to have a national frame.

thoughtfulnz · October 7, 2017, 10:50am

Just looking at the order you are doing things, your are reading the three directories in, holding them all in memory and writing the three out. It should, in a blunt way, require 1/3 of the memory if you tackle the files one at a time and dispose of the memory before tackling the next.

Also keep in mind that write.csv has an append option, so you can add to an aggregate file without holding it all in memory.

edgararuiz · October 18, 2017, 3:49pm

Another option would be to use sparklyr locally in your computer to "map" the files, but not bring all of its contents into memory, just the fields or segments you need to model at that time. That will also allow you to run SQL commands over the combined data. Since all of the files have the same layout, and they are in the same folder, you wouldn't even need to do a bind, because the folder itself will be interpreted as a large table by Spark. Here's an example of how to do that: https://github.com/rstudio/webinars/blob/master/42-Introduction%20to%20sparklyr/sparklyr-webinar1.Rmd

pgensler · October 18, 2017, 6:18pm

Can you please provide a dput of the file: "data/CPH1990","data/CPH2000",etc.. or even provide the first few lines of the files you are trying to read in to R, based on the code snippets above?If you have the link to the raw file where it is hosted?(that would be even better). It's very hard to recreate the issue you are experiencing without having any file or object even to use a starting point.

limacina · October 18, 2017, 6:52pm

From what other users have mentioned, it seems like you are trying to load more data than your computer's memory can handle. Other than working with relational databases e.g. SQLite, one other option (if you want to stay within the tidy verse) is to load your data and scripts onto a cloud-based solution. I've run R Studio in Amazon Web Services in the past following these steps. Again, there are probably more clever ways to do this, but cloud computing can be a nice way to deal with the "you need a bigger computer" issue than actually buying a large computer for one specific project.