I have a question about strategies to use a package with data that will exceed the size allowed by CRAN. (The package has a large amount of spatial data).
At this moment I only thought of having a pure data package in github that is a dependency or needs to be installed as part of the installation instructions. However, remotes::install_github() timeouts, even though I can download the repo's zip file from the browser without any issues.
Does anyone have strategies to deal with this type of situation?
When I was dealing with exactly the same use case - the package in question was CRAN - Package RCzechia - I ended up storing the offending data remotely on AWS and downloading them via a generic downloader function.
Issues I had to address were:
graceful fail in case of internet resources not available - a non negotiable CRAN requirement, and since their servers are not that powerful and heavily loaded they often timeout
caching the downloaded files on user machines - I ended up caching in tempdir (i.e. once per session) but others have opted for a more permanent caching (I believe {tigris} uses permanent caching). Again, this has implications with regards to CRAN policy
download methods - I work on Linux, so curl comes to me naturally, but a lot of users live on Windows which is a world apart; there were some inconsistencies in download methods
For #rnaturalearth I made 3 packages, 2 on CRAN, 1 not, rnaturalearth has methods and small example data, rnaturalearthdata has medium res data, rnaturalearthhires has hires data and is hosted by @rOpenSci because too big for CRAN.
Regarding permanent caching it is explained in Persistent config and data for R packages - R-hub blog you can use the rappdirs package, or if your package depends on R above version 4, you can use tools::R_user_dir().