Cannot read a txt file correctly with readr

Andrea · January 24, 2018, 4:18pm

I've been trying to read a txt file with the readr functions, without success. The file is from NASA, with space-separated records, and I guess what's tripping read_table \ read_table2 is the fact that there are two trailing spaces at the end of each row. It looks like I can't upload the file here. Anyway, the file is named train_FD001.txt and you can download a zip archive, containing various txt files including train_FD001.txt, from here:

https://ti.arc.nasa.gov/c/6/

A couple sample lines

1 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2388.06 9046.19 1.30 47.47 521.66 2388.02 8138.62 8.4195 0.03 392 2388 100.00 39.06 23.4190  
1 2 0.0019 -0.0003 100.0 518.67 642.15 1591.82 1403.14 14.62 21.61 553.75 2388.04 9044.07 1.30 47.49 522.28 2388.07 8131.49 8.4318 0.03 392 2388 100.00 39.00 23.4236  
1 3 -0.0043 0.0003 100.0 518.67 642.35 1587.99 1404.20 14.62 21.61 554.26 2388.08 9052.94 1.30 47.27 522.42 2388.03 8133.23 8.4178 0.03 390 2388 100.00 38.95 23.3442  
1 4 0.0007 0.0000 100.0 518.67 642.35 1582.79 1401.87 14.62 21.61 554.45 2388.11 9049.48 1.30 47.13 522.86 2388.08 8133.83 8.3682 0.03 392 2388 100.00 38.88 23.3739  
1 5 -0.0019 -0.0002 100.0 518.67 642.37 1582.85 1406.22 14.62 21.61 554.00 2388.06 9055.15 1.30 47.28 522.19 2388.04 8133.80 8.4294 0.03 393 2388 100.00 38.90 23.4044

If I try read_table, I get:

> training_set <- read_table("train_FD001.txt", col_names = FALSE)
Parsed with column specification:
cols(
  X1 = col_integer(),
  X2 = col_character()
)
> glimpse(training_set)
Observations: 20,631
Variables: 2
$ X1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ X2 <chr> "1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2...

which is not what I want. If I use read_table2, which should handle consecutive spaces/tabs, I get:

> training_set <- read_table2("train_FD001.txt", col_names = FALSE)
Parsed with column specification:
cols(
  .default = col_double(),
  X1 = col_integer(),
  X2 = col_integer(),
  X22 = col_integer(),
  X23 = col_integer(),
  X27 = col_character()
)
See spec(...) for full column specifications.
> glimpse(training_set)
Observations: 20,631
Variables: 27
$ X1  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ X2  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,...
$ X3  <dbl> -0.0007, 0.0019, -0.0043, 0.0007, -0.0019, -0.0043, 0.0010, -0.0034, 0.000...
$ X4  <dbl> -4e-04, -3e-04, 3e-04, 0e+00, -2e-04, -1e-04, 1e-04, 3e-04, 1e-04, 1e-04, ...
$ X5  <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,...
$ X6  <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 51...
$ X7  <dbl> 641.82, 642.15, 642.35, 642.35, 642.37, 642.10, 642.48, 642.56, 642.12, 64...
$ X8  <dbl> 1589.70, 1591.82, 1587.99, 1582.79, 1582.85, 1584.47, 1592.32, 1582.96, 15...
$ X9  <dbl> 1400.60, 1403.14, 1404.20, 1401.87, 1406.22, 1398.37, 1397.77, 1400.97, 13...
$ X10 <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.6...
$ X11 <dbl> 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.6...
$ X12 <dbl> 554.36, 553.75, 554.26, 554.45, 554.00, 554.67, 554.34, 553.85, 553.69, 55...
$ X13 <dbl> 2388.06, 2388.04, 2388.08, 2388.11, 2388.06, 2388.02, 2388.02, 2388.00, 23...
$ X14 <dbl> 9046.19, 9044.07, 9052.94, 9049.48, 9055.15, 9049.68, 9059.13, 9040.80, 90...
$ X15 <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3,...
$ X16 <dbl> 47.47, 47.49, 47.27, 47.13, 47.28, 47.16, 47.36, 47.24, 47.29, 47.03, 47.1...
$ X17 <dbl> 521.66, 522.28, 522.42, 522.86, 522.19, 521.68, 522.32, 522.47, 521.79, 52...
$ X18 <dbl> 2388.02, 2388.07, 2388.03, 2388.08, 2388.04, 2388.03, 2388.03, 2388.03, 23...
$ X19 <dbl> 8138.62, 8131.49, 8133.23, 8133.83, 8133.80, 8132.85, 8132.32, 8131.07, 81...
$ X20 <dbl> 8.4195, 8.4318, 8.4178, 8.3682, 8.4294, 8.4108, 8.3974, 8.4076, 8.3728, 8....
$ X21 <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0....
$ X22 <int> 392, 392, 390, 392, 393, 391, 392, 391, 392, 393, 392, 391, 393, 393, 391,...
$ X23 <int> 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 23...
$ X24 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,...
$ X25 <dbl> 39.06, 39.00, 38.95, 38.88, 38.90, 38.98, 39.10, 38.97, 39.05, 38.95, 38.9...
$ X26 <dbl> 23.4190, 23.4236, 23.3442, 23.3739, 23.4044, 23.3669, 23.3774, 23.3106, 23...
$ X27 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
>

much better, since it correctly and automatically detected the integer columns, and understood that the rest are numeric, but still wrong: there's no column x27.

Instead, base R read.table works nicely on the first try:

> training_set <- read.table("train_FD001.txt", header = FALSE)
> glimpse(training_set)
Observations: 20,631
Variables: 26
$ V1  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ V2  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,...
$ V3  <dbl> -0.0007, 0.0019, -0.0043, 0.0007, -0.0019, -0.0043, 0.0010, -0.0034, 0.000...
$ V4  <dbl> -4e-04, -3e-04, 3e-04, 0e+00, -2e-04, -1e-04, 1e-04, 3e-04, 1e-04, 1e-04, ...
$ V5  <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,...
$ V6  <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 51...
$ V7  <dbl> 641.82, 642.15, 642.35, 642.35, 642.37, 642.10, 642.48, 642.56, 642.12, 64...
$ V8  <dbl> 1589.70, 1591.82, 1587.99, 1582.79, 1582.85, 1584.47, 1592.32, 1582.96, 15...
$ V9  <dbl> 1400.60, 1403.14, 1404.20, 1401.87, 1406.22, 1398.37, 1397.77, 1400.97, 13...
$ V10 <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.6...
$ V11 <dbl> 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.6...
$ V12 <dbl> 554.36, 553.75, 554.26, 554.45, 554.00, 554.67, 554.34, 553.85, 553.69, 55...
$ V13 <dbl> 2388.06, 2388.04, 2388.08, 2388.11, 2388.06, 2388.02, 2388.02, 2388.00, 23...
$ V14 <dbl> 9046.19, 9044.07, 9052.94, 9049.48, 9055.15, 9049.68, 9059.13, 9040.80, 90...
$ V15 <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3,...
$ V16 <dbl> 47.47, 47.49, 47.27, 47.13, 47.28, 47.16, 47.36, 47.24, 47.29, 47.03, 47.1...
$ V17 <dbl> 521.66, 522.28, 522.42, 522.86, 522.19, 521.68, 522.32, 522.47, 521.79, 52...
$ V18 <dbl> 2388.02, 2388.07, 2388.03, 2388.08, 2388.04, 2388.03, 2388.03, 2388.03, 23...
$ V19 <dbl> 8138.62, 8131.49, 8133.23, 8133.83, 8133.80, 8132.85, 8132.32, 8131.07, 81...
$ V20 <dbl> 8.4195, 8.4318, 8.4178, 8.3682, 8.4294, 8.4108, 8.3974, 8.4076, 8.3728, 8....
$ V21 <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0....
$ V22 <int> 392, 392, 390, 392, 393, 391, 392, 391, 392, 393, 392, 391, 393, 393, 391,...
$ V23 <int> 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 23...
$ V24 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,...
$ V25 <dbl> 39.06, 39.00, 38.95, 38.88, 38.90, 38.98, 39.10, 38.97, 39.05, 38.95, 38.9...
$ V26 <dbl> 23.4190, 23.4236, 23.3442, 23.3739, 23.4044, 23.3669, 23.3774, 23.3106, 23...

Is there a way to make the readr functions work with this file? I'd rather not specify the column type explicitly for each column - not only I have many columns, but also different files can have different numbers of columns, so having to specify column types for each file separately is error-prone and unwieldy. At this point, it seems easier and faster to just use base R read.table. Anyway, a solution based on cols() specification could be fine, if you can manage to keep it simple, using for example .default.

danr · January 24, 2018, 5:17pm

This seems to work. It would be helpful is rather than saying what you are not looking for that you cobble a smaller example that shows both the input and output you want. It's pretty easy to guess it from what you have but it just would make it easier for the rest of us to help you.

path <- "/Volumes/VideoPhoto/rcommunity/readnasadata/CMAPSSData/train_FD001.txt"
x <- read.delim(path, sep= "", header = FALSE)
x[1:10,]
#>    V1 V2      V3     V4  V5     V6     V7      V8      V9   V10   V11
#> 1   1  1 -0.0007 -4e-04 100 518.67 641.82 1589.70 1400.60 14.62 21.61
#> 2   1  2  0.0019 -3e-04 100 518.67 642.15 1591.82 1403.14 14.62 21.61
#> 3   1  3 -0.0043  3e-04 100 518.67 642.35 1587.99 1404.20 14.62 21.61
#> 4   1  4  0.0007  0e+00 100 518.67 642.35 1582.79 1401.87 14.62 21.61
#> 5   1  5 -0.0019 -2e-04 100 518.67 642.37 1582.85 1406.22 14.62 21.61
#> 6   1  6 -0.0043 -1e-04 100 518.67 642.10 1584.47 1398.37 14.62 21.61
#> 7   1  7  0.0010  1e-04 100 518.67 642.48 1592.32 1397.77 14.62 21.61
#> 8   1  8 -0.0034  3e-04 100 518.67 642.56 1582.96 1400.97 14.62 21.61
#> 9   1  9  0.0008  1e-04 100 518.67 642.12 1590.98 1394.80 14.62 21.61
#> 10  1 10 -0.0033  1e-04 100 518.67 641.71 1591.24 1400.46 14.62 21.61
#>       V12     V13     V14 V15   V16    V17     V18     V19    V20  V21 V22
#> 1  554.36 2388.06 9046.19 1.3 47.47 521.66 2388.02 8138.62 8.4195 0.03 392
#> 2  553.75 2388.04 9044.07 1.3 47.49 522.28 2388.07 8131.49 8.4318 0.03 392
#> 3  554.26 2388.08 9052.94 1.3 47.27 522.42 2388.03 8133.23 8.4178 0.03 390
#> 4  554.45 2388.11 9049.48 1.3 47.13 522.86 2388.08 8133.83 8.3682 0.03 392
#> 5  554.00 2388.06 9055.15 1.3 47.28 522.19 2388.04 8133.80 8.4294 0.03 393
#> 6  554.67 2388.02 9049.68 1.3 47.16 521.68 2388.03 8132.85 8.4108 0.03 391
#> 7  554.34 2388.02 9059.13 1.3 47.36 522.32 2388.03 8132.32 8.3974 0.03 392
#> 8  553.85 2388.00 9040.80 1.3 47.24 522.47 2388.03 8131.07 8.4076 0.03 391
#> 9  553.69 2388.05 9046.46 1.3 47.29 521.79 2388.05 8125.69 8.3728 0.03 392
#> 10 553.59 2388.05 9051.70 1.3 47.03 521.79 2388.06 8129.38 8.4286 0.03 393
#>     V23 V24   V25     V26
#> 1  2388 100 39.06 23.4190
#> 2  2388 100 39.00 23.4236
#> 3  2388 100 38.95 23.3442
#> 4  2388 100 38.88 23.3739
#> 5  2388 100 38.90 23.4044
#> 6  2388 100 38.98 23.3669
#> 7  2388 100 39.10 23.3774
#> 8  2388 100 38.97 23.3106
#> 9  2388 100 39.05 23.4066
#> 10 2388 100 38.95 23.4694
str(x)
#> 'data.frame':    20631 obs. of  26 variables:
#>  $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ V2 : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ V3 : num  -0.0007 0.0019 -0.0043 0.0007 -0.0019 -0.0043 0.001 -0.0034 0.0008 -0.0033 ...
#>  $ V4 : num  -4e-04 -3e-04 3e-04 0e+00 -2e-04 -1e-04 1e-04 3e-04 1e-04 1e-04 ...
#>  $ V5 : num  100 100 100 100 100 100 100 100 100 100 ...
#>  $ V6 : num  519 519 519 519 519 ...
#>  $ V7 : num  642 642 642 642 642 ...
#>  $ V8 : num  1590 1592 1588 1583 1583 ...
#>  $ V9 : num  1401 1403 1404 1402 1406 ...
#>  $ V10: num  14.6 14.6 14.6 14.6 14.6 ...
#>  $ V11: num  21.6 21.6 21.6 21.6 21.6 ...
#>  $ V12: num  554 554 554 554 554 ...
#>  $ V13: num  2388 2388 2388 2388 2388 ...
#>  $ V14: num  9046 9044 9053 9049 9055 ...
#>  $ V15: num  1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 ...
#>  $ V16: num  47.5 47.5 47.3 47.1 47.3 ...
#>  $ V17: num  522 522 522 523 522 ...
#>  $ V18: num  2388 2388 2388 2388 2388 ...
#>  $ V19: num  8139 8131 8133 8134 8134 ...
#>  $ V20: num  8.42 8.43 8.42 8.37 8.43 ...
#>  $ V21: num  0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 ...
#>  $ V22: int  392 392 390 392 393 391 392 391 392 393 ...
#>  $ V23: int  2388 2388 2388 2388 2388 2388 2388 2388 2388 2388 ...
#>  $ V24: num  100 100 100 100 100 100 100 100 100 100 ...
#>  $ V25: num  39.1 39 39 38.9 38.9 ...
#>  $ V26: num  23.4 23.4 23.3 23.4 23.4 ...

Andrea · January 24, 2018, 5:28pm

Hi @danr thanks for your answer! Sorry, I'm not sure I understood you:

You mean that, rather than saying what I don't want, I should post an example with input, and desired output, right? But I did that: the input is train_FD001.txt, and the output is that of read.table("train_FD001.txt", header = FALSE). I even included that output in my first post.

Coming to your solution, yes, that's the output I'm looking for, but I was looking for a readr solution...if I need to stick to base R, read.table(path, header = FALSE) works fine. As a matter of fact, I think read.table(path, header = FALSE) (my solution) is just a shortcut for read.delim(path, sep= "", header = FALSE) (your solution). At least, output looks the same.

danr · January 24, 2018, 6:27pm

Good that you ended up with something that does what you want.

An example that is based on 5 lines or so from train_FD001.txt and then a hand built 5 row table with what your desired results are. It just makes it easier to see what you are looking for.

BTW read.table and read.delim are both part of utils package not base. Also readr imports tibble which imports utils so if you are using readr::read_table you using utils anyhow.

Andrea · January 24, 2018, 6:44pm

sorry, my bad, I thought utils was part of base R. Do you think there's a way to get my desired result also using functions from readr? Or should I stick to utils?

danr · January 24, 2018, 6:54pm

I think I misunderstood what you meant by base. I though you meant just functions from base, for example base::list.

if you type library(help="base") and you will see that base includes the utils package. It may not be possible to run base without utils, I'm not sure.

Whether you use read.table or or read.delim you won't be loading anything extra into your R session so I don't see it making any difference. read.table(path, header = FALSE)'is less typing so I'd use that

jimhester · January 24, 2018, 7:13pm

You can use a column type of "_" to drop the empty column

library(readr)
read_table2("CMAPSSData/train_FD001.txt", col_names = FALSE, col_types = cols(X27 = "_"))

Andrea · January 25, 2018, 11:14am

Thanks a bunch, @jimhester ! Do you think I should also open an issue in readr repo or isn't it appropriate?

jimhester · January 25, 2018, 1:49pm

No, this is the intended behavior in this case.

kenbutler · April 21, 2018, 1:37pm

My take is that these are data values separated by a single space each time, so that read_delim(filename," ") is the way to go. Then you don't have to worry about re-formatting the data, non-aligned columns or anything like that.

My understanding about read_table is that the columns should be aligned with each other. It seems to me that read_table is a good deal less flexible than the old read.table, but I would welcome having my understanding improved on this.

Andrea · April 21, 2018, 5:04pm

@kenbutler thanks for your suggestion! I tested it, but it gives worse results than my original attempt (read_table2 without a col_types argument): using read_delim I get two NA columns at the end of my dataframe, while using read_table2 without col_types I get only one. @jimhester stands as the accepted answer

RuReady · May 23, 2018, 8:15pm

Can you give data.table a try?

library(data.table)
training_set <- fread("train_FD001.txt")

Andrea · May 24, 2018, 7:58am

Hi, @RuReady,

thanks for your suggestion! I've already solved with @jimhester's suggestion, and I don't have time to make other tests, unfortunately.

Also, frankly I don't like data.table API: I find its only advantage over the tidyverse is speed and scale, and as you can see from the data, neither of them is a concern in my case. However, you have the data set, so if you'd like to make that test, by all means please go with it.