Loading Multiple csv in R issue - variable types causing erro

technobrat · October 10, 2021, 2:07am

Hi,

I have a folder that contains csv's in it, each file has around 3K rows/observations and 600 variables.

I've seen various topics about the subject and came up with my own "solution".

Create auxiliary function that reads csv, adds file name so I can later extract info from the filename which contains the year for each csv.

library(tidyverse)
library(fs)
### List CSV files in directory
csv_files <- fs::dir_ls("ACS_DP02_data")

# Aux function call to append file names:
read_plus <- function(flnm) {
  read_csv(flnm ) %>%
    mutate(filename = flnm)
}
# Create df from csv files
my_df1 <- csv_files %>% map_df(read_plus)

Initially it seemed to work, as seen below:

It looked right but then I noticed on observation 1, the row name... looked at the tail (theoretically last csv file in folder) checked file (csv) and tail (df) and it did not match. I also noticed that Id2 came up 9 times, indicating that the tables were attached after each other into a single df but the observations names were in there as well.

So I played around loading single files and noticed that I should skip 1st row. The data was imported correctly on individual files. Adding the skip = 1 to read_csv(), when attempting the 6th file I got the below error, basically not combining char with dbl .

Single file loading using read_csv() and skip = 1 would load data correctly, as seen below:

I therefore concluded I had to check if the variables were character and convert them to numeric, which lead me to the below code:

library(tidyverse)
library(fs)

### List CSV files in directory
csv_files <- fs::dir_ls("ACS_DP02_data")

# Aux function call to append file names:
read_plus <- function(flnm) {
  read_csv(flnm) %>% 
    mutate_if(is.character, as.numeric) %>% 
    mutate(filename = flnm)
}
# Create df from csv files
my_df1 <- csv_files %>% map_df(read_plus)

It seemed to have worked but when I checked the data everything that was character was now NA, like below:

Could anyone shine a light what's going on? I suspect my mutate_all() is not doing what I think it is... I've been troubleshooting this for quite some time now and I'm frustrated.

Well thank you for your time beforehand and patience.

andresrcs · October 10, 2021, 4:15am

It is hard to give you specific advice because your example is not reproducible (lack of sample data) but I can give you some pointers to help you move forward.

There is no need to implement this yourself, map_df() has been superseded by the more specific map_dfr() and map_dfc() functions. For your use case map_dfr() has the .id argument that when defined stores the names of the mapped list as a new column. The code pattern would be as follows:

library(tidyverse)

list_of_files <- list.files(path = "path_to_your_files",
                            pattern = "\\.csv$",
                            full.names = TRUE)
my_df <- list_of_files %>% 
    set_names() %>% 
    map_dfr(read_csv, .id = "filename")

read_csv() guesses the variable classes but if it is guessing wrong, you can specify the classes using the col_types argument so you get consistent variable classes among iterations. See the documentation:

col_types
One of NULL , a cols() specification, or a string. See vignette("readr") for more details.

If NULL , all column types will be imputed from guess_max rows on the input interspersed throughout the file. This is convenient (and fast), but not robust. If the imputation fails, you'll need to increase the guess_max or supply the correct types yourself.

Column specifications created by list() or cols() must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only() .

Alternatively, you can use a compact string representation where each character represents one column:

c = character

i = integer

n = number

d = double

l = logical

f = factor

D = date

T = date time

t = time

? = guess

_ or - = skip
By default, reading a file without a column specification will print a message showing what readr guessed they were. To remove this message, set show_col_types = FALSE or set 'options(readr.show_col_types = FALSE).

system · October 17, 2021, 4:15am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.