How to specify to R that I want to import certain columns with a particular class?

juandmaz · December 17, 2021, 8:47am

I have a problem with R and I don't know how to solve it. I am working with a database of 47228 observations and 68 columns of an excel sheet. The case is that I have 7 columns in which in the excel 47719 observations of 47728 (99.99%) appear as NA and the other cases as dates, and surprisingly when I import the base with the read_excel function of the readxl package, R import it as 100% as NA. In other words, in those 9 observations that are dates, R doesn't show the reliable content.

The problem is that I need to be able to analyze the database including that data and I cannot, because it does not appear.

As it is a problem that arises with the import, I am trying to explain to R that those 7 columns process them as dates, but I can't do it. The read_excel function has an argument which is col_types but I have to fill it with a vector that specifies the type of the 68 columns, and what I am looking for is to simply specify those 8 columns. Does anyone know how to do it? I saw that there is a function called read_table from the readr package that allows on import to specify the class of only the variables that interest me, with the argument cols_only =, but for some reason when I execute that function R aborts the session with a warning "R encountered a fatal error."

Does anyone know what this could be? Greetings and thank you very much. Sorry for my english.

technocrat · December 17, 2021, 9:12am

See the {readr} package—in conjunction with the {lubridate} package it should bring in your data in usable form if you first export it from Excel to a csv file.

The readr::read_csv() function does a reasonably good job at guessing the appropriate column type. I find that dates in Excel are pretty hopeless, and I expect them to be brought in as strings to be parsed with lubridate::mdy(VARNAME) for date strings such as 17/12/2021, for example.

If you need more specific help, see the FAQ: How to do a minimal reproducible example reprex for beginners.

juandmaz · December 17, 2021, 11:11am

Hello, thank you for the answer. The problem is that when I use the readr package, R aborts the session with a warning "R encountered a fatal error". So, until I find how to solve that problem I can't use that package.

technocrat · December 17, 2021, 10:29pm

times 68 fields is not a very big object only 3-10MB unless the machine has less than 4GB, which is about the minimum to do very much at all.

Can you load the libraries and everything else you have in session before the attempted read and post the output from

SessionInfo()

juandmaz · December 18, 2021, 8:33am

Hello, thanks for the answer. I found out why I get that error, it is because the read_table function can only read .text or .csv files and I was trying to read an .xls file.
I also found the solution I was looking for, in the read_excel function I have to add the argument guess_max = Inf, because R by default only reads the first 1000 rows to deduce the class of the column, then adding that argument R reads all the rows and puts the reliable information.

system · January 8, 2022, 8:34am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.