I've been trying to read a txt file with the readr
functions, without success. The file is from NASA, with space-separated records, and I guess what's tripping read_table
\ read_table2
is the fact that there are two trailing spaces at the end of each row. It looks like I can't upload the file here. Anyway, the file is named train_FD001.txt
and you can download a zip archive, containing various txt files including train_FD001.txt
, from here:
A couple sample lines
1 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2388.06 9046.19 1.30 47.47 521.66 2388.02 8138.62 8.4195 0.03 392 2388 100.00 39.06 23.4190
1 2 0.0019 -0.0003 100.0 518.67 642.15 1591.82 1403.14 14.62 21.61 553.75 2388.04 9044.07 1.30 47.49 522.28 2388.07 8131.49 8.4318 0.03 392 2388 100.00 39.00 23.4236
1 3 -0.0043 0.0003 100.0 518.67 642.35 1587.99 1404.20 14.62 21.61 554.26 2388.08 9052.94 1.30 47.27 522.42 2388.03 8133.23 8.4178 0.03 390 2388 100.00 38.95 23.3442
1 4 0.0007 0.0000 100.0 518.67 642.35 1582.79 1401.87 14.62 21.61 554.45 2388.11 9049.48 1.30 47.13 522.86 2388.08 8133.83 8.3682 0.03 392 2388 100.00 38.88 23.3739
1 5 -0.0019 -0.0002 100.0 518.67 642.37 1582.85 1406.22 14.62 21.61 554.00 2388.06 9055.15 1.30 47.28 522.19 2388.04 8133.80 8.4294 0.03 393 2388 100.00 38.90 23.4044
If I try read_table
, I get:
> training_set <- read_table("train_FD001.txt", col_names = FALSE)
Parsed with column specification:
cols(
X1 = col_integer(),
X2 = col_character()
)
> glimpse(training_set)
Observations: 20,631
Variables: 2
$ X1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ X2 <chr> "1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2...
which is not what I want. If I use read_table2
, which should handle consecutive spaces/tabs, I get:
> training_set <- read_table2("train_FD001.txt", col_names = FALSE)
Parsed with column specification:
cols(
.default = col_double(),
X1 = col_integer(),
X2 = col_integer(),
X22 = col_integer(),
X23 = col_integer(),
X27 = col_character()
)
See spec(...) for full column specifications.
> glimpse(training_set)
Observations: 20,631
Variables: 27
$ X1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ X2 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,...
$ X3 <dbl> -0.0007, 0.0019, -0.0043, 0.0007, -0.0019, -0.0043, 0.0010, -0.0034, 0.000...
$ X4 <dbl> -4e-04, -3e-04, 3e-04, 0e+00, -2e-04, -1e-04, 1e-04, 3e-04, 1e-04, 1e-04, ...
$ X5 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,...
$ X6 <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 51...
$ X7 <dbl> 641.82, 642.15, 642.35, 642.35, 642.37, 642.10, 642.48, 642.56, 642.12, 64...
$ X8 <dbl> 1589.70, 1591.82, 1587.99, 1582.79, 1582.85, 1584.47, 1592.32, 1582.96, 15...
$ X9 <dbl> 1400.60, 1403.14, 1404.20, 1401.87, 1406.22, 1398.37, 1397.77, 1400.97, 13...
$ X10 <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.6...
$ X11 <dbl> 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.6...
$ X12 <dbl> 554.36, 553.75, 554.26, 554.45, 554.00, 554.67, 554.34, 553.85, 553.69, 55...
$ X13 <dbl> 2388.06, 2388.04, 2388.08, 2388.11, 2388.06, 2388.02, 2388.02, 2388.00, 23...
$ X14 <dbl> 9046.19, 9044.07, 9052.94, 9049.48, 9055.15, 9049.68, 9059.13, 9040.80, 90...
$ X15 <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3,...
$ X16 <dbl> 47.47, 47.49, 47.27, 47.13, 47.28, 47.16, 47.36, 47.24, 47.29, 47.03, 47.1...
$ X17 <dbl> 521.66, 522.28, 522.42, 522.86, 522.19, 521.68, 522.32, 522.47, 521.79, 52...
$ X18 <dbl> 2388.02, 2388.07, 2388.03, 2388.08, 2388.04, 2388.03, 2388.03, 2388.03, 23...
$ X19 <dbl> 8138.62, 8131.49, 8133.23, 8133.83, 8133.80, 8132.85, 8132.32, 8131.07, 81...
$ X20 <dbl> 8.4195, 8.4318, 8.4178, 8.3682, 8.4294, 8.4108, 8.3974, 8.4076, 8.3728, 8....
$ X21 <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0....
$ X22 <int> 392, 392, 390, 392, 393, 391, 392, 391, 392, 393, 392, 391, 393, 393, 391,...
$ X23 <int> 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 23...
$ X24 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,...
$ X25 <dbl> 39.06, 39.00, 38.95, 38.88, 38.90, 38.98, 39.10, 38.97, 39.05, 38.95, 38.9...
$ X26 <dbl> 23.4190, 23.4236, 23.3442, 23.3739, 23.4044, 23.3669, 23.3774, 23.3106, 23...
$ X27 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
>
much better, since it correctly and automatically detected the integer
columns, and understood that the rest are numeric
, but still wrong: there's no column x27
.
Instead, base R read.table
works nicely on the first try:
> training_set <- read.table("train_FD001.txt", header = FALSE)
> glimpse(training_set)
Observations: 20,631
Variables: 26
$ V1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ V2 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,...
$ V3 <dbl> -0.0007, 0.0019, -0.0043, 0.0007, -0.0019, -0.0043, 0.0010, -0.0034, 0.000...
$ V4 <dbl> -4e-04, -3e-04, 3e-04, 0e+00, -2e-04, -1e-04, 1e-04, 3e-04, 1e-04, 1e-04, ...
$ V5 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,...
$ V6 <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 51...
$ V7 <dbl> 641.82, 642.15, 642.35, 642.35, 642.37, 642.10, 642.48, 642.56, 642.12, 64...
$ V8 <dbl> 1589.70, 1591.82, 1587.99, 1582.79, 1582.85, 1584.47, 1592.32, 1582.96, 15...
$ V9 <dbl> 1400.60, 1403.14, 1404.20, 1401.87, 1406.22, 1398.37, 1397.77, 1400.97, 13...
$ V10 <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.6...
$ V11 <dbl> 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.6...
$ V12 <dbl> 554.36, 553.75, 554.26, 554.45, 554.00, 554.67, 554.34, 553.85, 553.69, 55...
$ V13 <dbl> 2388.06, 2388.04, 2388.08, 2388.11, 2388.06, 2388.02, 2388.02, 2388.00, 23...
$ V14 <dbl> 9046.19, 9044.07, 9052.94, 9049.48, 9055.15, 9049.68, 9059.13, 9040.80, 90...
$ V15 <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3,...
$ V16 <dbl> 47.47, 47.49, 47.27, 47.13, 47.28, 47.16, 47.36, 47.24, 47.29, 47.03, 47.1...
$ V17 <dbl> 521.66, 522.28, 522.42, 522.86, 522.19, 521.68, 522.32, 522.47, 521.79, 52...
$ V18 <dbl> 2388.02, 2388.07, 2388.03, 2388.08, 2388.04, 2388.03, 2388.03, 2388.03, 23...
$ V19 <dbl> 8138.62, 8131.49, 8133.23, 8133.83, 8133.80, 8132.85, 8132.32, 8131.07, 81...
$ V20 <dbl> 8.4195, 8.4318, 8.4178, 8.3682, 8.4294, 8.4108, 8.3974, 8.4076, 8.3728, 8....
$ V21 <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0....
$ V22 <int> 392, 392, 390, 392, 393, 391, 392, 391, 392, 393, 392, 391, 393, 393, 391,...
$ V23 <int> 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 23...
$ V24 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,...
$ V25 <dbl> 39.06, 39.00, 38.95, 38.88, 38.90, 38.98, 39.10, 38.97, 39.05, 38.95, 38.9...
$ V26 <dbl> 23.4190, 23.4236, 23.3442, 23.3739, 23.4044, 23.3669, 23.3774, 23.3106, 23...
Is there a way to make the readr
functions work with this file? I'd rather not specify the column type explicitly for each column - not only I have many columns, but also different files can have different numbers of columns, so having to specify column types for each file separately is error-prone and unwieldy. At this point, it seems easier and faster to just use base R read.table
. Anyway, a solution based on cols()
specification could be fine, if you can manage to keep it simple, using for example .default
.