dynamically specify column types in with readr::read_csv

RobLBaker · September 25, 2024, 2:07am

I'd like to use the col_types parameter in read_csv to dynamically specify column types based on known column types (say from a separate metadata file).

Why doesn't the approach outlined below work and is there any easier/better way to do it?

I've been using the following approach:

get a list of the column types from a .csv using spec_csv()
test whether the column types/formatting match the external source (metadata)
when mismatches occur, update the spec_csv() output to match the external source (e.g. metadata)
use the updated spec_csv output as the input to col_type() when using read_csv().

This seems to work well for dates and numerics, but not for factors.

For example:

#generate testing .csv file:
readr::write_csv(iris, "iris.csv")

# get col specs from csv
spec <- spec_csv("iris.csv")

# edit/update the spec_csv output:
# (for this example it's easy to do it by hand, but imagine there are hundreds or thousands of columns that need to be specified)
class(spec$cols$Species ) <- "col_factor"
spec$cols$Species$ordered <- FALSE
spec$cols$Species$include_na <- FALSE
factors <- c("virginica", "setosa", "versicolor")
spec$cols$Species$levels <- factors 

# use updated spec_csv output to specify columns types:
# this does not generate any errors, but also does not change the column type to factor:
test <- readr::read_csv("iris.csv", col_type = spec)
is.factor(test$Species) #FALSE

#this generates an error:
test <- readr::read_csv("iris.csv", col_type = list(spec))
#Error: Some `col_types` are not S3 collector objects: 1

#generates the same error as above... at this point, it's just trial and error on my end, which is why I'm posting here:
test <- readr::read_csv("iris.csv", col_type = cols(spec))

#I also tried converting spec into a col_spec. As I suspect it also failed:
spec2 <- as.col_spec(spec)
test <- readr::read_csv("iris.csv", col_type = spec2)

What does this error mean? How do I resolve it? Is there a better way to dynamically specify column types when using read_csv?

craig.parylo · September 25, 2024, 9:55am

Hi @RobLBaker,

Here I update the specification for the Species variable using a call to readr::col_factor()which seems to work.

I don't understand why your approach doesn't work, though, as the spec objects appear similar

library(tidyverse)
library(here)
# set data and guess specification
readr::write_csv(iris, here('data', 'iris.csv'))
spec <- readr::spec_csv(here('data', 'iris.csv'))

# update specification for 'Species' field following review of metadata
spec$cols$Species <- col_factor(c("setosa", "versicolor", "virginica"), include_na = FALSE, ordered = FALSE)

# read data using specification
test <- readr::read_csv(here('data', "iris.csv"), col_types = spec)

# test
is.factor(test$Species) # TRUE

RobLBaker · September 25, 2024, 1:06pm

Thanks @craig.parylo! That does indeed work and looks adaptable to my use-case.

On a slightly broader scale, is this generally how people approach the problem of dynamically specifying column types, or is there a better way to do it?

craig.parylo · September 25, 2024, 2:22pm

You're welcome, @RobLBaker.

The data I tend to access is ad-hoc and text-based, so my preference is to force all data as characters during the load then correct afterward. For example, all data in csv files is coded as text even if a field represents a date, then I'll do a mutate to cast data to the correct type afterwards.

I can't fault the approach you're using, however. If you know up-front what the data types should be then the use of the colum specification is the right tool to ensure consistent results.

system · October 2, 2024, 2:22pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.