read_csv not respecting col_select, can't read problems()

arrows · September 9, 2021, 1:18am

Hi! I'm pretty new to R. Enjoying it immensely.

I'm reading in all CSV files from a directory using map_dfr to apply read_csv over a list of filenames. The CSV files have a varying number of columns. I only want to import 1:7 and to discard 8: onwards where they exist. All files have a column 8, and some files have some text that is parsed as columns 9, 10 etc when I look at the files in Excel. I don't care about any of these columns. Notably, these extra columns don't have headers.

My code is:

df_csv <- map_dfr(
  csvpaths, 
  read_csv, 
  skip_empty_rows = TRUE,
  col_names = TRUE,
  col_select = 1:7,
  col_types = cols_only(
    Project = col_character(),
    Date = col_character(),
    Employee = col_character(),
    Role = col_character(),
    Rate = col_double(),
    Hours = col_double(),
    Amount = col_double()
  ),
  .id = "Source"
)

What's happening:

The files are being read in correctly, in that the final df includes the correct information AFAIK.
In my RStudio console, read_csv is now outputting what looks like a time? 0s Why? I think this started showing up after my last package update.
In my latest test of 15 files, I received 8 warnings (the 0s time is output after the start of the warning message - unsure why?):

Warning messages:                                0s
1: One or more parsing issues, see `problems()` for details 
2: One or more parsing issues, see `problems()` for details 
3: One or more parsing issues, see `problems()` for details 
4: One or more parsing issues, see `problems()` for details 
5: One or more parsing issues, see `problems()` for details 
6: One or more parsing issues, see `problems()` for details 
7: One or more parsing issues, see `problems()` for details 
8: One or more parsing issues, see `problems()` for details

I'm pretty sure that the warnings relate to instances where a CSV file has more than 8 columns. But!
I can't get any output from problems() so I can't tell what is happening.

> problems()
>

It doesn't seem to matter if I restrict the map_dfr call to just a single filename; I still can't view any output from problems.

read_csv IS respecting the argument to select only columns 1:7 in the read, but it ISN'T stopping errors from being created from files which have more than 7 columns, which I thought was the purpose of using col_select in the first place.

How can I get these warnings sorted? I previously wrapped this in suppressWarnings but realised it was masking some other real parsing errors I needed to fix, which I've now done.

Hayward · September 9, 2021, 6:20am

For the parsing issues, it kind of sounds like read_csv is encountering some items that don't fit the column type.

If this type casting is the issue, you can set everything to 'col_character()' to see if you still get the warnings. Alternatively, you could not define the columns, but rather set the argument "guess_max = " to something quite large. Then, if R finds a letter in a number column, it'll just assume the whole column should be a character column instead.

I find it helpful to read in each CSV to a list before binding them, so I can inspect each. That would look something like:

df_csv <- csvpathsmap %>% map(
  ~ read_csv(., 
             skip_empty_rows = TRUE,
             col_names = TRUE,
             col_select = 1:7,
             col_types = cols_only(
               Project = col_character(),
               Date = col_character(),
               Employee = col_character(),
               Role = col_character(),
               Rate = col_double(),
               Hours = col_double(),
               Amount = col_double()
               )))

Then you can check the column types of each one like:

df_csv %>% map(str)

If they look right, then you can bind them:

df_csv <- df_csv %>% bind_rows(.id = "Source")

arrows · September 13, 2021, 11:54am

Thanks, @Hayward , for responding and for your suggestions!

I've tried the inspection tactic you suggested. All of the files have columns being parsed correctly, in that they each have 7 columns read, and those columns have the correct types. Casting all to 'col_character()' didn't make the parsing warnings go away.

I've now mucked around with many other ways to do this:
My directory has 232 .csv files with sizes ranging from 4-211 rows.

I've tested with a for loop that each individual file parses correctly with my column type mapping and selection as in the original post. They all do.
I've tried structuring the map slightly differently to pipe the list of filename to map_dfr(~read_csv(.,[args])) and I still get the parsing errors and note about warnings(), which is empty.
I've tried using sapply to apply the read_csv function to the list of filenames, a version of what you've suggested here that is simpler for me to understand. I've used a very simple syntax here:
sapply(csvpaths, read_csv, col_names = TRUE, col_types = cols(.default = col_character()))
This works, with no parsing errors. I note that the output is a list of tibbles with variable rows, and 8 columns. ONE FILE has 7 columns! All have the right column names, and are casted to character types, so they should bind correctly, but...
taking the output of this sapply and piping to bind_rows throws the same warning about parsing errors, and and the warnings reference problem(), which is still empty.

I'm going crazy over this, since I want to receive any parse errors that are meaningful, but getting more than 50 of them every time, and getting the "correct" output in the end, is maddening.

Hayward · September 14, 2021, 12:37am

I hadn't before encountered problems(), so went and took a look at the readr documentation.

Looks like 'read_csv' calls parsing functions of the form 'parse_*()'. If those parsing functions encounter an issue, they populate an attribute that you can access with problems(). However, I don't know where the problems attribute gets stored as you loop through with map.

It may help to try the function 'stop_for_problems()'. I think it would be called as follows below, but I didn't test it to make sure.

df_csv <- csvpathsmap %>% map(
    ~ stop_for_problems(read_csv(., 
               skip_empty_rows = TRUE,
               col_names = TRUE,
               col_select = 1:7,
               col_types = cols_only(
                   Project = col_character(),
                   Date = col_character(),
                   Employee = col_character(),
                   Role = col_character(),
                   Rate = col_double(),
                   Hours = col_double(),
                   Amount = col_double()
               ))))

It's hard to troubleshoot without a dataset, so I can't find what works and give you an exact answer. However, I'm guessing that you may be able to access a problems attribute with 'problems()' if you pass in the dataframe where the mapping fails. Say it fails on the 10th csv path. Then maybe this would report out problems()?

problems(df_csv[[10]])

If that doesn't work and you can't tell which dataframe has parsing failures, one last thought might be to try reporting on everything from your original read-in-dataframes (meaning prior to attempting to incorporate the function stop_for_problems()):

df_csv %>% map(problems)

(Side note: the '~' in my code above just means I'm making a simple function but am too lazy to create a name for it and write out 'function(){ }'. Also, the '.' means 'put here whatever data gets piped in from map'. In this case it's a path.)

system · October 5, 2021, 12:37am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.