Version control with Google Drive

Brett-Johnson · January 9, 2018, 12:23am

I've experimented using Google Drive and GitHub with my team (a small ecological research team) for version control and collaboration. I've found that both have there uses and I'm keen to share how I've been doing it so that I can hear from others how they are doing things, and whether I'm on the right track.

I initially started off committing everything I worked on to Github in different sub folders in the same repo. All of my internal analyses that aren't meant for a public report or peer reviewed paper went into different folders in the same general 'internal' private repo. This worked all right when it was just me using the repo. But when I brought a co-worker into the mix, we soon realized what a pain it actually is to try to collaborate on GitHUB on a day to day basis. We were spending a load of time messing around with merge conflicts and all sorts of other un-intuitive issues. We felt GitHUB was cumbersome for day to day analysis collaboration internally.

So now I would like to move back to simply using Google Drive for internal analyses. Google drive is great for version controlling (especially now that you can 'name versions' in Google Drive similar to a GitHUB commit). I sometimes rely on the revision history of Google Drive to actually roll back a script, because it's way more intuitive than doing that in Git not to mention that every time you save your script in, it gets an un-named version in Google Drive, so the chances of not losing your work is actually greater using Google Drive. Google Drive allows you share all the files you and data you need, and using the here() package we shouldn't have to worry about working directories.

I think GitHUB is useful for presenting analysis in an open science context for public communication artifacts, whether thats a paper, a poster, a presentation, or a dashboard. Using the fork and pull method, external collaborators or reviewers can fork your repository, make changes, and send you a pull request. In all likelihood that probably wont happen that much. The biggest benefits of putting my analysis into public repos on GitHUB is that it adds an additional level of peer review, it shows there's nothing to hide in my analytical methods, other people can build off my work.

So in summary:

For internal analysis: We use Google drive combined with R-Studio projects and the here package and load data from within our google drive folder. Doing this we can achieve easier collaboration and maximal portability of scripts.
We use GitHUB for public analyses related to science communication artifacts. Scripts there load data from citable data packages or our internal database via an api.

Does anyone else use a similar workflow? Are there any disadvantages to this that I may not see?

Thanks!

jennybryan · January 9, 2018, 6:33am

Looks like you've formed a very thoughtful approach based on experience! I agree that GitHub and Google Drive both have their uses.

I do have one observation about your frustrating experience with GitHub for collaboration: I think you picked an unusual model for a group like yours. What you used is called a monorepo. Tech giants like Google and Facebook do this, but they've got specific reasons and a lot of custom tooling for it!

I think most civilians like us should have one repo per logical project. One day you might want to experiment again with GitHub for well-defined, internal, collaborative projects. It's a very popular approach for good reason. But, again, use whatever works!

hughparsonage · January 9, 2018, 1:37pm

I don't see merge conflicts as a pain: I see them as the reason to use git. In fact I would only consider working outside GitHub if I didn't experience merge conflicts. What you may be overlooking is the importance of source control.

What I mean by source control (as distinct from version control) is some assurance about your source code. In particular, the ability to monitor any change, who made the change, and why -- both now and throughout the history of your project. Right now I can go to any line in any GitHub project and work out what commit brought it there. Obligatory xkcd, but even in practice the commit message and blame history can be quite informative as to whether to change the existing code.

The fact that you and your partner were getting merge conflicts is a sign you made different decisions on the same parts of your project. That git alerts you to that conflict is far preferable than just keeping the last change without either person knowing about the alternative (unless the change is trivial, though genuinely trivial changes are in my experience rarer than superficially trivial changes).

It's possible your team should be using pull requests more or, as Jenny alluded to, dividing logically distinct projects into separate repositories.

jennybryan · January 9, 2018, 5:04pm

I've also noticed than many Git-newcomers sort of half embrace Git: they start using it but also keep relying heavily on Word, Excel, PDF, etc. And these binary files are utterly beyond the automagic git merge machinery, so they have lots of merge conflicts. They end up with the worst of both worlds . So sometimes people are having lots of nuisance merge conflicts for these reasons and resolving them constantly is a drag and not actually productive. The real solution is to redesign the workflow and use, say, Google Drive for things that still need to be Docs, Sheets, etc.

tiernan · January 10, 2018, 12:06am

My approach is pretty similar:

Use Github for storing/versioning script files (one repo per project)
Use Google Drive for storing data (original, interim, and final products) and communication documents (.Rmd's, pdfs, etc.)

It isn't the cleanest setup but it has worked for me so far. I use the googledrive package to access my Drive account as a "poor man's database".

I haven't had many project collaborators, but I do work on three different machines (work, personal, and cloud) so it's helpful to have access to the scripts (via git pull) and data (via googledrive::drive_download() and drive_upload()) from anywhere.

Brett-Johnson · January 10, 2018, 11:40pm

Thank you for the thoughtful reply. I've always kind of struggled with the one repo per logical project idea. The organization I work for has many different research programs, one of which I manage. So from the organizations perspective my research program is one logical project so we get one repo.

Obviously we have many different projects within our program however. We might be writing several papers that each require a specific data analysis, meanwhile I may put together some slides in using .Rmd and isoslides. I want the code for all these to be public. Are those all what you would consider separate logical projects?

And then there's the question of what to do with all those random bits of code you write, for example, to produce a sub-sample list of samples to send off for analysis. You want to record how you came to that decision somewhere (in the form of a script) but it doesn't really belong to a project. Perhaps this is where simply putting these files in Google Drive in a 'sub-sample lists' folder is sufficient.

I understand now that merge conflicts are the reason for using git. My understanding is certainly still developing, and what I think I was experiencing wasn't precisely merge conflicts per se but rather whenever I did a pull, I would get an error message because he or I had removed some files from the giant monorepo, or changed files that weren't text files as you mention. Also, we were experiencing many working directory issues, luckily my computer wasn't burned down, but I should be safe now I've heard of the here() package.

I think there is some utopian hybrid of using Google Drive for certain things (Sheets, Docs, Slides, random R scripts) and GitHUB for others (logical projects, collaborative projects, peer-reviewed analyses). Getting closer to the ultimate workflow, but certainly still developing...

tiernan · January 11, 2018, 12:21am

The example you provided sounds like it could be wrapped up in a function (maybe?) and documented in a package instead of a standalone script. That's an approach that some people have found successful when building modules for shiny applications and it might have some value for you as well.

jennybryan · January 11, 2018, 5:04am

Yep. I frequently have one repo per talk.

Sounds like "pull, then "push" might be a good mantra Yeah, this doesn't sound like merge conflicts (which is great news) but something simpler. There are changes on master that you don't have. So you need to pull and merge them before you can push. It's a good habit in general.

Another paper that might help you think about dividing work into units is this, specifically the Project Organization section:

As a rule of thumb, divide work into projects based on the overlap in data and code files. If 2 research efforts share no data or code, they will probably be easiest to manage independently. If they share more than half of their data and code, they are probably best managed together, while if you are building tools that are used in several projects, the common code should probably be in a project of its own. Projects do often require their own organizational model, but below are general recommendations on how you can structure data, code, analysis outputs, and other files. The important concept is that it is useful to organize the project by the types of files and that consistency helps you effectively find and use things later.

hao_ye · January 12, 2018, 10:11pm

I think one option for the "multi-project, one-repo" is to not only have internal folders, but separate branches. Collaborators can still work together to contribute to a single branch, and then when the work is ready to share or release, you can use Pull Requests to merge the changes into master.

jennybryan · January 12, 2018, 10:24pm

Good point! Branches are useful for some things and frustrating for others. If you often want to see/access two things at the same time, it's wildly frustrating for them to be in different branches. But if your use case passes, that test, then yeah I agree this is another nice option.

spncrfx · January 12, 2018, 10:42pm

This is a great discussion. The move to github from drive has really made me think about project design and setup, and I find that my folders are much cleaner in github repositories because of that thought. However, I'm wondering how people deal with the snippets of code mentioned above that form along the way of a project, but that aren't really part of the published result. I know I have a bunch of scripts sitting in my folders, but are not currently being tracked by github. Is the answer just to track all code produced, or is there a cleaner way to do it?

Andrea · January 12, 2018, 10:56pm

That's an option. Or, you could just send Word, Excel and the lot to the grinder, and use R Markdown, csv files, etc. which can all be automagically merged.