How do I read first columns of csv and ingore the rest?

Here some example about how you could read those type of file

library(readr)

# Input data with extra unwanted bits.
extra_blanks <- "foo,bar,baz,,,\naaa,100,ccc,,,\naaa,200.0,ccc,,,\n"
extra_garbage <- "foo,bar,baz,xtra,xtra\naaa,100,ccc,xtra,xtra\naaa,200,ccc,xtra,xtra\n"

# using data.table
data.table::fread(extra_blanks, select = 1:3)
#>    foo bar baz
#> 1: aaa 100 ccc
#> 2: aaa 200 ccc
data.table::fread(extra_garbage, select = 1:3)
#>    foo bar baz
#> 1: aaa 100 ccc
#> 2: aaa 200 ccc

# using vroom
vroom::vroom(extra_blanks, col_select = 1:3)
#> Rows: 2
#> Columns: 3
#> Delimiter: ","
#> chr [2]: foo, baz
#> dbl [1]: bar
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <dbl> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa     200 ccc
vroom::vroom(extra_garbage, col_select = 1:3)
#> Rows: 2
#> Columns: 3
#> Delimiter: ","
#> chr [2]: foo, baz
#> dbl [1]: bar
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <dbl> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa     200 ccc

# using readr
col_spec <- cols_only(foo = col_character(), 
                      bar = col_integer(), 
                      baz = col_character())

readr::read_csv(extra_blanks, col_types = col_spec)
#> Warning: Missing column names filled in: 'X4' [4], 'X5' [5], 'X6' [6]
#> Warning: 1 parsing failure.
#> row col               expected actual         file
#>   2 bar no trailing characters     .0 literal data
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <int> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa      NA ccc
readr::read_csv(extra_garbage, col_types = col_spec)
#> Warning: Duplicated column names deduplicated: 'xtra' => 'xtra_1' [5]
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <int> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa     200 ccc

Created on 2020-05-12 by the reprex package (v0.3.0)

I believe readr will deal with guessing column names before ignoring them in the cold spec provided. I guess the way it works is that it needs to know the name to apply specs.

this is a rather old issue it seems

Maybe vroom's col_select argument can enter readr world... will see.

Hope it helps!

3 Likes