This is my first topic. I would like to ask how to proceed regarding reproducible reseach. I took a look at the video-tutorial about using Git and RStudio, since I am concerned about doing my research reproducible.
However, some time ago I had problems commiting large data files to git, and also I think they might not be necessary to commit, since they are not expected to change, and also if I change something I used code to change it so it will be in the git repo anyway...
I would like to know the best way to handle large data files in Git, what is the way to proceed for the best reproducible research.
There is no good way to handle big files in git. I would go even further and say that you should never commit anything large into git since it makes it wonky.
Unfortunately, as far as I know, there is no "good" solution for that, just a LOT of different ones - https://github.com/EthicalML/awesome-production-machine-learning#model-and-data-versioning. Personally, I didn't use all of them, but they all seem to add a bit of friction to the process. Of those that I know, DVC looks to be based around the idea of "git for data", but your mileage may vary.
The ultimate goal is reproducible research: somebody else can easily take your project and recreate your results.
Git is a very useful tool for doing this with code. But it's not the only tool, and sometimes its the wrong tool. Like Mishabalyasin says, Git is the wrong tool for large data files. It stores compressed copies of files, with an additional copy each time the file changes.
You got it. If somebody can follow and reproduce your work by downloading your Git repo and a separate data file, then you've succeeded. Using version-controlled scripts to munge the data is a great practice.
If you've already committed the file and want to wipe it from your Git repo's history, you can remove every instance of it following the directions in the Pro Git guide: Removing a File from Every Commit. Note that you should make a back-up of your repo before trying this, in case something goes wrong.
(Which doesn’t even get into their packages that help with accessing data from repositories like FigShare or DataONE).
You might get some interesting answers if you posted your situation (in as much detail as possible) over on the ROpenSci community discussion site. If you try that, be sure to drop a link here so others with the similar questions can follow in your footsteps!
I've used Git LFS to version large files with Git and GitHub. The basic idea is that instead of versioning the big data files, it instead versions a plain text file that contains a hash. The large file is then uploaded to a server and indexed with that hash. Then when a user runs a Git command like git clone or git checkout, only the version of the data file that corresponds to the hash is downloaded.
Unfortunately parts of the process can be rough:
To download the actual data file (and not just the placeholder with the hash), the user has to have Git LFS installed and configured on their machine.
GitHub has strict limits on the amount of storage and bandwidth when using Git LFS on their servers. Thus if you have many large files, you have to upgrade to a paid plan.
But I agree with your assessment: if your large data files never change, there is no need to version them. You can provide a script that downloads the big data files, so the setup to reproduce the analysis could look something like:
git clone https://github.com/username/repository.git
cd repository/
Rscript download.R
Rscript run.R