Using col_type() with readr: how to manage inconsistent column content

imkidd57 · May 15, 2019, 1:04pm

Dear R experts
I am reading in a .csv file with readr, where the data is mostly dates and times in separate columns. The 'date' columns are fine to convert with col_type and col_date(%d/%m/%Y), but the 'time' values in the columns are not consistently 4 digits to enable col_time(%H%M) to work properly: there is no leading zero in some of the entries (e.g. "931" representing 09:31).

I know that both the stringr function 'str_pad()' or the Base-R function 'sprintf()' can be used to pad the time column digits out to four, and separately, from that point, col_time(%H%M) will correctly convert the format to a time. However I'm struggling to put these two things together in the readr process.

I have tried:

nesting both the functions within the readr process
a pipe to direct the padded 4-digit output of str_pad() to col_time()
padding the column with str_pad() before defining all the columns with col_type
...with no overall success.

I would be very grateful of any insight or suggestions on how to achieve this.
Many thanks

mara · May 15, 2019, 1:17pm

Is there a reason to do it all in the initial import? You can use the readr parsers to manipulate colspec even after doing the import.

imkidd57 · May 15, 2019, 1:36pm

Hi Mara - many thanks for your quick reply

I think I was trying to be efficient, but I guess you are right since my current 'efficiency score' is zero!

To be honest I didn't think of using readr elements again after importing. Is it possible to use col_types() outside of a readr process?

mara · May 15, 2019, 1:53pm

In effect, yes, but the specifics are a little different (though the underlying functions are the same) when you're not actively reading it in (you'll use parse_time(), for example, instead of col_time()). From the readr docs re Column parsers:

Column parsers define how a single column is parsed, or a parse a single vector. Each parser comes in two forms: parse_xxx() which is used to parse vectors that already exist in R and col_xxx() which is used to parse vectors as they are loaded by a read_xxx() function.

imkidd57 · May 15, 2019, 2:48pm

Ah - I also hadn't appreciated that parse_xxx() will work on columns.
Managed top get it working in a two-stage process after the readr import, and the times are now converting correctly.
Many thanks indeed for your help!

system · May 22, 2019, 2:48pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.