I usually use git to track changes to my projects, but I tend to .gitignore output figures and datafiles, since they can make the git repo very large.
However it would be useful to know whether the output figures (or files) have changed. I was wondering the best way to do this. I was thinking about generating and saving a hash code, something like this.
Does this seem like a good approach?
library(digest) # hash functions
library(purrr)
library(ggplot2)
# make a thing
set.seed(1)
thing <- data.frame(x = runif(10), y= runif(10)) %>%
ggplot() +
geom_point(aes(x = x, y = y))
# hash the thing
hash <- purrr::map_chr(thing, digest, algo="xxhash32")
# check if the thing has changed from last time
thingfile <- "thing.png"
hashfile <- "thing_hash.rds"
if (!file.exists(hashfile)){
ggsave(thingfile, thing)
saveRDS(hash, hashfile)
} else {
existing <- readRDS(hashfile)
if (isTRUE(all.equal(hash, existing))){
print(paste(hashfile, "unchanged"))
} else {
print(paste(hashfile, "changed!"))
ggsave(thingfile, thing)
saveRDS(hash, hashfile)
}
}
Unfortunately, this github issue seems to suggest this approach won't work very well.
The ggplot object includes links to the R environment, since it sometimes has to access objects that are resolved at print time. This is stored in the plot_env slot, however this captures the whole environment, not just the bits needed by the plot. So changes to the environment can change the hash but not the plot itself.
You can remove the env slot by adding
hash$plot_env <- ""
but this will could mask changes that affect the plot.