Vroom with multiple files, different number of columns

dobrowski · February 13, 2020, 12:11am

I'm trying to import multiple files with vroom rather than read_delim in a for loop. The multiple files have nearly but not exactly identical file structures. One of the files has one less column than the others. Is there a way to make it still import all files and just have NAs for the missing columns data?

Here are the files I want to import: https://www.cde.ca.gov/ds/sd/sd/filessd.asp

library(vroom)

setwd("data")
files <- fs::dir_ls(glob = "susp*txt")

susp_vroom <- vroom(files)

Error: Files must all have 22 columns:

File 7 has 21 columns

joels · February 13, 2020, 12:31am

To avoid the error (which appears to be occurring because vroom seems to assume by default that a vector of files should all have the same number of columns), try using map or map_df, which will read each file in separately. The self-contained example below saves three versions of the built-in mtcars data frame, each with a different number of columns, reproduces the error you're getting, and then shows the map and map_df approaches.

library(tidyverse)
library(vroom)

# Write 3 files with varying number of columns
write_csv(mtcars[,-9], "f1.csv")
write_csv(mtcars[,-c(7:8)], "f2.csv")
write_csv(mtcars, "f3.csv")

# Get vector of file names
f = fs::dir_ls(glob="f*csv")
f
#> f1.csv f2.csv f3.csv

d1 = vroom(f)
#> Error: Files must all have 10 columns:
#> * File 2 has 9 columns

# Read each file separately into a list of 3 data frames
d2 = map(f, ~vroom(.x))
# [output messages deleted for brevity]

# Read each file and stack into a single data frame
d3 = map_df(f, ~vroom(.x))
# [output messages deleted for brevity]

^{Created on 2020-02-12 by the reprex package (v0.3.0)}

mara · February 13, 2020, 3:00pm

Yes, this is because if you're binding them into one data frame, they need to have the same columns.

dobrowski · February 13, 2020, 4:11pm

Thanks, that works well. From the vignette I thought it natively did it but your solution is simple and elegant using purr.

joels · February 15, 2020, 8:13am

dplyr::bind_rows just stacks all the input data frames, regardless of whether their column names match, resulting in a stacked data frame with as many columns as there are unique column names in the input data frames. I didn't realize vroom worked differently (or even that vroom automatically reads in and stacks a vector of files) until I worked on this question.

system · February 22, 2020, 8:13am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.