My own partial understanding.
As you pointed out, readr::type_convert()
does call readr:::type_convert_col()
under the hood, this function is defined in a compiled language.
So we can turn to the source of this C++ function, here is a simplified version:
[[cpp11::register]] cpp11::sexp type_convert_col(
const cpp11::strings& x,
const cpp11::list& spec,
const cpp11::list& locale_,
int col,
const std::vector<std::string>& na,
bool trim_ws) {
CollectorPtr collector = Collector::create(spec, &locale);
for (int i = 0; i < x.size(); ++i) {
t = Token(begin, begin + Rf_length(string), i - 1, col - 1, false);
collector->setValue(i, t);
}
Here a mystery to me: based on the order of the arguments, it looks to me like locale_
and col
are inverted compared the R code, yet it works. So I guess C++11 might take into account the names of the arguments and not just the order (in particular, if you remove locale_ =
it fails, unless you correct the order).
The part of interest now is the setValue()
for a given i
(the number of the string within the vector) and t
(a Token as defined here).
The setValue is defined within the Collector class. When the Collector is first created, it is assigned a subclass:
CollectorPtr Collector::create(const cpp11::list& spec, LocaleInfo* pLocale) {
std::string subclass(cpp11::as_cpp<cpp11::strings>(spec.attr("class"))[0]);
if (subclass == "collector_date") {
SEXP format_ = spec["format"];
std::string format = (Rf_isNull(format_)) != 0U
? pLocale->dateFormat_
: cpp11::as_cpp<std::string>(format_);
return CollectorPtr(new CollectorDate(pLocale, format));
}
}
and we find our warning message here:
void CollectorDate::setValue(int i, const Token& t) {
bool res =
(format_.empty()) ? parser_.parseLocaleDate() : parser_.parse(format_);
if (!res) {
warn(t.row(), t.col(), "date like " + format_, std_string);
}
}
We can find some useful context here:
-
A token is an iterator that points to a single value in source. A token
also contains metadata about the location of the value (e.g. the row and col,
needed for informative error message)
-
A tokeniser converts a stream of characters from a source into a stream of
tokens.
-
Field collectors take a stream of tokens, parsing each token and storing
it an R vector.
There is one collector for each column type: CollectorLogical
,
CollectorInteger
, CollectorDouble
etc. On the R side, these are
represented by col_logical
, col_integer()
, col_double()
etc.
Collector::create()
dynamically creates a Collector subclass from an
R list.
So I think that kind of explains much of it? The error message is citing t.row()
and t.col()
, which are the row and column encoded by the Tokenizer. These are set as i-1
and col-1
during the Token creation, where i
is the element of the vector, and col
the column passed as an argument from R (in the which(TRUE)[1]
).
You can check that directly playing with these arguments:
readr:::type_convert_col(char_cols[[1]], specs$cols[[1]], readr::locale(), 1,
na = c("", "NA"), trim_ws = TRUE)
readr:::type_convert_col(char_cols[[1]], specs$cols[[1]], readr::locale(), 2,
na = c("", "NA"), trim_ws = TRUE)
readr:::type_convert_col(char_cols[[1]], specs$cols[[1]], readr::locale(), 3,
na = c("", "NA"), trim_ws = TRUE)
this indeed changes the value of col
in the error message.
and if you use:
char_cols <- tibble::tibble(
x = c("2022-02-02", "potota", "pititi"),
)
you get error messages for "[1,1]" and "[2,1]" but not "[0,1]"
Finally, to come back at your question in the title, you'll actually notice that the warning message is for i-1
and col-1
: actually i
and col
are still 1-indexed! The source code explicitly subtracts 1
to switch to 0-indexing just before using them.