Sparklyr 's error

Would changing the mode parameter help you out?

For instance, for a bad CSV file:

writeLines(c("bad", 1, 2, 3, "broken"), "bad.csv")

There are a couple modes that can help troubleshoot parsing issues:

  • PERMISSIVE: NULLs are inserted for missing tokens.
  • DROPMALFORMED: Drops lines which are malformed.
  • FAILFAST: Aborts if encounters any malformed line.

Which can be used as follows:

spark_read_csv(
  sc,
  "bad",
  "bad.csv",
  columns = list(foo = "integer"),
  infer_schema = FALSE,
  options = list(mode = "DROPMALFORMED"))
# Source:   table<bad> [?? x 1]
# Database: spark_connection
    foo
  <int>
1     1
2     2
3     3

In Spar 2.X, there is also a secret column _corrupt_record that can be used to output those incorrect records:

spark_read_csv(
  sc,
  "decimals",
  "bad.csv",
  columns = list(foo = "integer", "_corrupt_record" = "character"),
  infer_schema = FALSE,
  options = list(mode = "PERMISIVE")
)
# Source:   table<decimals> [?? x 2]
# Database: spark_connection
    foo `_corrupt_record`
  <int> <chr>            
1     1 NA               
2     2 NA               
3     3 NA               
4    NA broken