Does R sometimes index from 0?

rkb965 · September 21, 2023, 5:09pm

Apologies for the clickbait title but I am perplexed and intrigued by this warning message that seems to refer to an index of [0,1] in a data frame. I think the underlying code is C, and I am out of ideas for how to understand where [0,1] came from and would love help understanding!

Thanks!

char_cols <- tibble::tibble(
  x = "potato"
)
col_types = readr::as.col_spec(tibble::tibble(
  x = as.Date("2021-01-01") + 0:10)
)

guesses <- list(
  x = "character"
)

specs <- readr:::col_spec_standardise(col_types = col_types, 
                                      col_names = "x", 
                                      guessed_types = guesses)

readr:::type_convert_col(char_cols[[1]], specs$cols[[1]], which(TRUE)[1], 
                         locale_ = readr::locale(), na = c("", "NA"), trim_ws = TRUE)
#> Warning: [0, 1]: expected date like , but got 'potato'
#> [1] NA

readr:::type_convert_col
#> function (x, spec, locale_, col, na, trim_ws) 
#> {
#>     .Call(`_readr_type_convert_col`, x, spec, locale_, col, na, 
#>         trim_ws)
#> }
#> <bytecode: 0x559784b0f4b8>
#> <environment: namespace:readr>

^{Created on 2023-09-21 with reprex v2.0.2}

AlexisW · September 21, 2023, 6:35pm

My own partial understanding.

As you pointed out, readr::type_convert() does call readr:::type_convert_col() under the hood, this function is defined in a compiled language.

So we can turn to the source of this C++ function, here is a simplified version:

[[cpp11::register]] cpp11::sexp type_convert_col(
    const cpp11::strings& x,
    const cpp11::list& spec,
    const cpp11::list& locale_,
    int col,
    const std::vector<std::string>& na,
    bool trim_ws) {

CollectorPtr collector = Collector::create(spec, &locale);

for (int i = 0; i < x.size(); ++i) {
  t = Token(begin, begin + Rf_length(string), i - 1, col - 1, false);
  collector->setValue(i, t);
}

Here a mystery to me: based on the order of the arguments, it looks to me like locale_ and col are inverted compared the R code, yet it works. So I guess C++11 might take into account the names of the arguments and not just the order (in particular, if you remove locale_ = it fails, unless you correct the order).

The part of interest now is the setValue() for a given i (the number of the string within the vector) and t (a Token as defined here).

The setValue is defined within the Collector class. When the Collector is first created, it is assigned a subclass:

CollectorPtr Collector::create(const cpp11::list& spec, LocaleInfo* pLocale) {
  std::string subclass(cpp11::as_cpp<cpp11::strings>(spec.attr("class"))[0]);

  if (subclass == "collector_date") {
    SEXP format_ = spec["format"];
    std::string format = (Rf_isNull(format_)) != 0U
                             ? pLocale->dateFormat_
                             : cpp11::as_cpp<std::string>(format_);
    return CollectorPtr(new CollectorDate(pLocale, format));
  }
}

and we find our warning message here:

void CollectorDate::setValue(int i, const Token& t) {
  bool res =
        (format_.empty()) ? parser_.parseLocaleDate() : parser_.parse(format_);
  if (!res) {
      warn(t.row(), t.col(), "date like " + format_, std_string);
  }
}

We can find some useful context here:

A token is an iterator that points to a single value in source. A token
also contains metadata about the location of the value (e.g. the row and col,
needed for informative error message)

A tokeniser converts a stream of characters from a source into a stream of
tokens.

Field collectors take a stream of tokens, parsing each token and storing
it an R vector.

There is one collector for each column type: CollectorLogical,
CollectorInteger, CollectorDouble etc. On the R side, these are
represented by col_logical, col_integer(), col_double() etc.
Collector::create() dynamically creates a Collector subclass from an
R list.

So I think that kind of explains much of it? The error message is citing t.row() and t.col(), which are the row and column encoded by the Tokenizer. These are set as i-1 and col-1 during the Token creation, where i is the element of the vector, and col the column passed as an argument from R (in the which(TRUE)[1]).

You can check that directly playing with these arguments:

readr:::type_convert_col(char_cols[[1]], specs$cols[[1]], readr::locale(), 1,
                         na = c("", "NA"), trim_ws = TRUE)
readr:::type_convert_col(char_cols[[1]], specs$cols[[1]], readr::locale(), 2,
                         na = c("", "NA"), trim_ws = TRUE)
readr:::type_convert_col(char_cols[[1]], specs$cols[[1]], readr::locale(), 3,
                         na = c("", "NA"), trim_ws = TRUE)

this indeed changes the value of col in the error message.

and if you use:

char_cols <- tibble::tibble(
  x = c("2022-02-02", "potota", "pititi"),
)

you get error messages for "[1,1]" and "[2,1]" but not "[0,1]"

Finally, to come back at your question in the title, you'll actually notice that the warning message is for i-1 and col-1: actually i and col are still 1-indexed! The source code explicitly subtracts 1 to switch to 0-indexing just before using them.

technocrat · September 22, 2023, 12:13am

FWIW: C/C++ and relatives are indeed 0-indexed. Because pointers.

system · November 3, 2023, 12:13am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.