Analysis package naming, or can package name differ from .Rproj name?

jalsalam · June 6, 2018, 2:17pm

I am attempting to explore different approaches for packing up reproducible R analyses. For example, I am reading:

"Packaging data analytical work reproducibly using R (and friends)" from the [Practical Data Science with Stats] (PeerJ : Practical Data Science for Stats)

and the design docs that came out of the research compendia group at the 2017 Ropensci unconf.

So one major recommendation is to put your analysis inside an R package. But R packages have fairly strict naming rules:

There are three formal requirements: the name can only consist of letters, numbers and periods, i.e., .; it must start with a letter; and it cannot end with a period. Unfortunately, this means you can’t use either hyphens or underscores, i.e., - or _, in your package name. I recommend against using periods in package names because it has confusing connotations (i.e., file extension or S3 method).

In addition, usethis::create_package and usethis::use_description error if used within a local folder/.Rproj file that does not follow this naming scheme.

For highly re-usable packages that are destined for CRAN, the space of short names without word-delimiters seems sufficient, but I think analysis packages are likely to chafe against these requirements. For presentation purposes, I want to name my Github repo something like My-Analysis-Of-Specific-Phenomena-In-Specific-Place. I could just take out all the spaces but it results in very ugly names.

So my question:

Can the package name in DESCRIPTION differ from the .Rproj/local folder/Github repo name?

If so, I assume that best place to make the cut is still have .Rproj == local == Github != package name, but I suppose there might be another way to do it. Will problems come up with using package tooling such as devtools/testthat/roxygen if I do this?

jalsalam · June 6, 2018, 3:14pm

As a follow-up, to partially answer my own question, it seems that based on in-the-wild observation, that YES, the package name can differ from the repo name. In particular, two canonical examples of compendia mentioned here have this property:

I am still wondering if there are pitfalls I should be aware of with this sort of name difference.

jennybryan · June 6, 2018, 4:41pm

Technically, you can have different names. But from a human point of view, it ends up being very confusing. I think you will be a much happier person if you can find a way to make all the names coincide (directory name, project name, package name, git/github repo name).

jalsalam · June 6, 2018, 8:45pm

I figured that would be the answer. Do you have any specific things that you have seen go wrong?

I honestly wasn't cherry-picking when I came across the examples from Carl Boettiger and Ben Marwick. Dashes are very popular in Github repo-naming. I've seen it said that to make a project into a package all you need to do is add a DESCRIPTION file, but if you also need to rename your project and repo, that seems like at least a little bit of a barrier.

jennybryan · June 6, 2018, 8:58pm

No, really just the confusion and friction around finding the thing locally and on GitHub. I always seemed to search or hope for autocomplete on the wrong name. I don't have the hands-on experience with "package as data analysis receptacle" that @benmarwick and @cboettig do, though, so they are better people to ask.

benmarwick · June 6, 2018, 10:11pm

This is a good question, and one I've mostly avoided because there was no obvious (to me) best options, and I wasn't really sure what to do. I struggled over:

Having a CRAN-compliant name that is shared by the GitHub repo, the project, and the pkg, but might be inscrutable and confusing to others because it's so short and lacking punctuation, or
Have a longer, more natural and reader-friendly name for the GitHub repo, and the RStudio project, but different, CRAN-compliant name for the pkg (like the one cited above), and hope that doesn't cause confusion because of the non-matching names, like @jennybryan noted above

There doesn't seem to be any technical reasons to prefer one approach over the other, so it's mostly a matter of style, and where to emphasize user-friendliness, I think

Reflecting on my more recent work, it seems that I've settled pretty much on the first option, harmonizing the repo-proj-pkg names to be the same, for example https://github.com/benmarwick/ktc11, https://github.com/benmarwick/mjbnaturepaper, & https://github.com/benmarwick/datacitation

I think Jenny's observation about confusion and friction is the same factor that led me to settle on harmonizing the names in my recent work. The cognitive burden of getting back into a project after some time away is real! Now I let the README do the work of deciphering to the reader the inscrutable short repo-proj-pkg names. Thanks to the GitHub UI these are prominent and I think most people are used to looking at the top of README for making sense of a repo.

cboettig · June 7, 2018, 12:02am

Good question. My short answer is that the package name for all my compendia is simply compendium so I don't have to come up with a unique CRAN-compliant name. Here's my reasoning:

I often create compendium-style projects that don't have any functions in a R/ directory, and consist only of .Rmd files and a DESCRIPTION; e.g. https://github.com/cboettig/noise-phenomena . In such cases, the DESCRIPTION serves only as a place to keep basic metadata and manage dependencies; i.e. you (and Travis CI) can devtools::install() the compendium as a way of installing the dependencies (including use of Remotes from GitHub, as in that example), but you would never load the package with a library() call (since it doesn't provide functions anyway). As such, the package name is rather moot -- you'll see that example simply says Package: compendium, just like my template: https://github.com/cboettig/compendium/ .

Maybe this is just lazy and having a bunch of compendia all called compendium is a bad idea, but as they don't contain any functions or NAMESPACE anyway, there's no actual NAMESPACE collisions. So I try to follow this model for things that are strictly a 'compendium' (no functions in R/, not intended to pass R CMD CHECK. When I do need custom functions, I tend to put those separately in a proper package that has tests, some docs, and can pass checks (and thus has to have a decent package name). This also helps me separate things that are 'just compendia' from 'proper packages', i.e. it's okay if future work depends/imports an actual package, but I'd rather not have compendia importing functions from a previous compendium.

Not sure if that made any sense or is ill advised, but would appreciate thoughts either way!

jalsalam · June 7, 2018, 1:09pm

Interesting! Your approach is quite different from what I was imagining. My main interests in moving towards analysis-as-package is so that I can better use testing and function documentation.

I have tried putting the 'analysis' in one project, and the functions in a package in another project, but during development I have found that awkward, as it means that I am constantly having to have two instances of Rstudio open (one for the analysis, one for the package functions), and for any changes to functions I have to rebuild/install on the package instance, and then restart R session and re-load the package on the analysis instance. Too slow -- much better to be able to Ctrl-Shift-L.

It seems like the cleanest thing will be for me to start thinking about package naming rules when I am starting up a new analyses, and Ben and Jenny suggest.

cboettig · June 7, 2018, 5:47pm

@jalsalam Yes, that is an excellent point. I agree that switching windows like that feels suboptimal and I also try and avoid it.

I have found that early in an exploratory analysis (i.e. when you are making lots of changes to both the "package functions" as well as the "analysis"), that having complete documentation and tests for "package functions" is overkill, as the analysis part leads me to continually needing to refactor those functions to take different arguments etc and thus I would need rewrite the documentation and tests as well. Rather, I let these functions just live at the top of my .Rmd notebook or as an R script in my notebook dir while they are really in flux. As functions get to the point where they are more stable -- i.e. I no longer feel the need to have an editor open to both the function definition and the analysis, then I feel it is time to move the function into a package with tests and documentation. Sometimes this is indeed the current repo, (at which time I'll need to come up with a package name), and so the repo gains a name, an R/, and is "promoted" from "Compendium" to "Package". Other times, I start a fresh package in a fresh repo (which makes it easier to align GitHub name with package name of course).

Before I started splitting R/ functions into separate packages, I'd often have multiple package-style compendia with divergent versions of the same R functions, which became difficult to maintain (particularly as I increasingly will work on multiple compendia at the same time which all share some common functions). Not trying to say "my way is right / better" than any alternative, just wanted to share the context that led me to this pattern. I think it basically comes down to the complexity of the analysis:

your analysis needs no custom functions, .Rmd alone is sufficient. -> no need for R/, NAMESPACE, or a package name, you just have a 'compendium' but not a full 'package'.
your analysis needs a few custom functions with documentation and testing. -> use "analysis-as-package".
you have written extensive functions with careful documentation that you will reuse across multiple analyses -> put those in a separate package, individual analyses can live in 'compendium'

Clearly this can be iterated upon as well.