Package data needs to be interacted with before being accessible

wvictor14 · January 23, 2023, 12:34am

Looking for some clarity on some behaviour related to data in R packages that I don't understand.

I have written some tests for an R package I'm developing. The tests rely on some data that travels with the package. This data has been ingested into the package using usethis::use_data() such that lazydata: true in description, and the corresponding .rda files are present under data/

From my understanding lazy data loading means that when my package is loaded (e.g. via library() or devtools::load_all()), I should be able to access my data when I call it. However, what I'm finding is that my data needs to be interacted with, before I can use it. Let me demonstrate:

Here is an example of a test that tests the extract_gene function:

test_that("extract_gene works", {
  metadata <- extract_gene(metadata = metadata, expr = counts, genes = 'FOXP3')
  expect_true('FOXP3' %in% colnames(metadata))
})

where objects metadata and counts are data objects exported by my package. So in the test environment, my package is loaded and I expect to be able to call them like how I've written. But the tests fail, with error messages indicating that the data is missing.

However, if I add a line above my function that interacts with the data, the test will pass:

test_that("extract_gene works", {
  dim(counts) # does not print
  metadata <- extract_gene(metadata = metadata, expr = counts, genes = 'FOXP3')
  expect_true('FOXP3' %in% colnames(metadata))
})

This runs without error, however notably the dim(counts) call does not print anything. It doesn't produce an error, but it's clear it doesn't actually run. I think it's because the data is "invisible" until interacted with once.

I don't understand this behaviour at all. But for now my workaround is to add some "filler" calls to my data in every test so that the rest of my tests can "see" the data.

technocrat · January 23, 2023, 3:31am

I'm not a package developer, so grain of salt, etc.

I've notice that some datasets, like mtcars are always available, just by invoking them, as in

mtcars

Others, seem to depend on what was done during packaging.

library(titanic)
# primed
head(titanic_test)
#>   PassengerId Pclass                                         Name    Sex  Age
#> 1         892      3                             Kelly, Mr. James   male 34.5
#> 2         893      3             Wilkes, Mrs. James (Ellen Needs) female 47.0
#> 3         894      2                    Myles, Mr. Thomas Francis   male 62.0
#> 4         895      3                             Wirz, Mr. Albert   male 27.0
#> 5         896      3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0
#> 6         897      3                   Svensson, Mr. Johan Cervin   male 14.0
#>   SibSp Parch  Ticket    Fare Cabin Embarked
#> 1     0     0  330911  7.8292              Q
#> 2     1     0  363272  7.0000              S
#> 3     0     0  240276  9.6875              Q
#> 4     0     0  315154  8.6625              S
#> 5     1     1 3101298 12.2875              S
#> 6     0     0    7538  9.2250              S

library(cluster.datasets)
# unprimed
head(acidosis.patients)
#> Error in head(acidosis.patients): object 'acidosis.patients' not found
data("acidosis.patients") |> head()
#> [1] "acidosis.patients"

^{Created on 2023-01-22 with reprex v2.0.2}

So, there's a difference in packaging that creates the difference in when they come into namespace. That seems to be explained by

The data subdirectory is for data files, either to be made available via lazy-loading or for loading using data() . (The choice is made by the ‘LazyData’ field in the DESCRIPTION file: the default is not to do so.) It should not be used for other data files needed by the package, and the convention has grown up to use directory inst/extdata for such files.

Writing R Extensions.

wvictor14 · January 24, 2023, 6:20am

Hm yes that behaviour seems consistent with the description of lazydata:true; titanic lazyload is set to true whereas cluster.datasets it is not.

In my example lazyload is set to true

nirgrahamuk · January 24, 2023, 12:00pm

you mention both metadata and counts are objects provided lazily by your package.
you find that you need some interaction with counts to make extract_gene ( your function) work,
but to be clear; metadata doesnt need the same ?
the fact that extract_gene is taking in counts under a param you have named 'expr' make me expect that you are using this for some sort of metaprogramming ? whereas metadata is just being used a plain object in the standard ways. probably there is some issue with the evaluation of expr given the lazy loading. This might make you decide to continue to load metadata lazily, but counts eagerly ? or else experiment like you do with forcing evaluation of counts...

just thoughts.

wvictor14 · January 26, 2023, 6:05pm

Thank you for the thoughts.

I figured it out. Sorry I can't share this package for others to see.

Yes @nirgrahamuk the behaviour is exclusive to the counts object but not metadata. That was the key bit of info. The counts data is an object with class "dgcMatrix" (sparse matrix), which requires the package "Matrix". Even though I have Matrix under Imports, and I can see that Matrix is being loaded when my package is loaded, this causes this weird behaviour where the object needs to be interacted with twice.

I don't fully understand why this occurs, but my solution (which is really a workaround), was to convert the sparse matrix into a normal matrix. Luckily my data is small enough that this wasn't an issue.

At the moment I don't have time to investigate further, but may update this thread in the future when I eventually learn why...

Thanks both!

system · February 2, 2023, 6:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.