Would changing the mode
parameter help you out?
For instance, for a bad CSV file:
writeLines(c("bad", 1, 2, 3, "broken"), "bad.csv")
There are a couple modes that can help troubleshoot parsing issues:
-
PERMISSIVE:
NULL
s are inserted for missing tokens. - DROPMALFORMED: Drops lines which are malformed.
- FAILFAST: Aborts if encounters any malformed line.
Which can be used as follows:
spark_read_csv(
sc,
"bad",
"bad.csv",
columns = list(foo = "integer"),
infer_schema = FALSE,
options = list(mode = "DROPMALFORMED"))
# Source: table<bad> [?? x 1]
# Database: spark_connection
foo
<int>
1 1
2 2
3 3
In Spar 2.X, there is also a secret column _corrupt_record
that can be used to output those incorrect records:
spark_read_csv(
sc,
"decimals",
"bad.csv",
columns = list(foo = "integer", "_corrupt_record" = "character"),
infer_schema = FALSE,
options = list(mode = "PERMISIVE")
)
# Source: table<decimals> [?? x 2]
# Database: spark_connection
foo `_corrupt_record`
<int> <chr>
1 1 NA
2 2 NA
3 3 NA
4 NA broken