Project management with Google Drive, R, github, and here()

I tend to structure new project as:
/All_projects/New_project/Code/Github_project_name/Github_project_name.rproj
/All_projects/New_project/Code/Github_project_name/R/scripts.R
/All_projects/New_project/Data/lots of data not synced with github
/All_projects/New_project/Outputs/outputs not synced with github

This allows team members to sync files with my corporate google drive account, and avoids loads of data and other misc crap being synced through github (space issues, format nagging, etc). So far, so good.

One minor wrinkle is here(). I want to use it, I want to love it, but I can't get into it. The docs suggest you can create a .here file and then use here() twice to override the top level directory (i.e. /New_project rather than /New_project/Code/Github_project_name/) but there doesn't seem to be a clear way of doing that, and the docs suggest here() will look upwards until it's satisfied. Since it's already in its R project folder, it's already satisfied.

  1. Can I overrule this and use here() better, so that I can then save files to Outputs, import data from Data, etc, without starting each script by specifying loadloc, the location of the root on my system (/home/me/Documents/etc/etc/All_projects/New_project) and having everyone else make their own line for this, before we can all then use file.path(loadloc, "Data") or whatever?

  2. Is there a better project organisational structure folks can recommend? Is there accepted best practice in this?

Thanks y'all!

This isn't a great answer because I can't really reprex the file structure, but I'll try to paste it here. If you make New_Project an R project as well, you can call here::i_am() to have here::here() point to this location.

> here::i_am("code/github_repo/R/dir_structure.R")
here() starts at /Users/.../Projects/new_project
> fs::dir_tree("../")
../
└── new_project
    ├── code
    │   └── github_repo
    │       ├── R
    │       │   └── dir_structure.R
    │       ├── data -> ../../data #symlink, see comment below
    │       ├── github_repo.Rproj
    │       └── output -> ../../output #symlink, see comment below
    ├── data
    ├── new_project.Rproj
    └── output

# You could create a shortcut to your code like this so you can just use here(R)
> R <- here::here("code", "github_repo", "R")
> here::here(R)
[1] "/Users/.../Projects/new_project/code/github_repo/R"

Another option, which I use in practice when we have data we cannot share publicly within the repo, I usually get around this with creating symlinks. This keeps a lot of the benefits of having your .Rproj file within the repo where you're executing code. I just add symlinks to the data and output folders within the project folder. Then your references from here::here() remain purely from the repo .Rproj location. If you don't want to include those symlinks in your repo, just add them to your .gitignore.

Thanks so much for the reply Eric. Also nice work on the file structure code, I was envisioning the same and realised I didn't know where those tree structure characters were!

Ok so solution 1: make new_project an R project as well: if one opens this in RStudio & syncs with git, won't it automatically include and aim to sync the whole subfolder structure i.e. data, output, etc, but also the subordinate github_repo and github_repo.Rproj?

Unless I'm misunderstanding, and dir_structure.R isn't just an arbitrary name, but is a technical document format which explains the directory structure to here?

Either way, I'm confused as to how one should use this structure given the 2 R projects... maybe this is due to me expecting an R project also to be a github sync, which isn't necessarily true I suppose.

Option 2: symlinks: I use these liberally on my home/main linux machine for various things, but haven't tried in shared, cross-OS applications. If I create github_repo/data and make a symlink to ../../data, I'd expect that to work fine for me, but I wonder if it would work for others? Presumably the symlink pointer file would be synced via git... and they're simple old-school files... so maybe it just works?

Option 3: your .gitignore comment got me thinking: possibly would a neat solution be:

└── new_project (github_repo)
    ├── R
    ├── data
    ├── new_project.Rproj
    └── output

Then just add /data, /output, and any others, to .gitignore?
I suppose they wouldn't get automatically synced that way though, due either Synology or gDrive (or both) rules which block syncing of .git folders, probably wisely to avoid breaking git. But this may be overridable...

Thanks for your time & brain. Sorry I'm not getting the dir_structure.R / 2 projects concept immediately!

Sorry about the dir_structure.R file. That was actually just the script holding the code I used to create that tree and demonstrate using here::i_am().

I think option 1 is probably the least efficient and most confusing for other users. From an RStudio perspective the version control tab shows up when the directory holding the project is initialized as a e.g. Git repository, which may or may not have remotes pointing to GitHub. So if you open the new_project .Rproj pointer that won't have the Git tab for pushing changes. You'd have to open the github_repo.Rproj pointer to do that. The benefit of having the top level new_project.Rproj point is that when you call here::i_am("code/github_repo/R/dir_structure.R") as I was doing in that dir_structure.R file, it sets here::here() to point to the new_project level so you can call here::here("data", "my_data.csv") for example. It was basically a work around to make here work for you in your current setup.

You bring up a good point with option 2 in that I'm not sure how that would behave across different OS platforms.

I think the easiest and most straightforward would be to migrate to option 3 and just ignore the data and output directories in your .gitignore. The only downside here that is alleviated with the symlink option is that sharing data across projects becomes more difficult. However, if this doesn't happen for you, it's probably the most compact way to keep your projects organized in your cloud storage system without exposing any data in the GitHub remote.

Ok so my own notes for future, and then report on how well it goes (!):

  1. Tidy up folder structure locally. In my case I like to prepend project folders with YYYY-MM so they're in order of when they started, in this case /2024-01_SharksFishCoral-FrenchPoly/.
  2. Create a new blank github repository with a similar name, in this case FIU-SharkFishCoral-FrenchPoly
  3. Git clone the repository into the local folder
  4. Drag all folders into the repository folder, i.e. /2024-01_SharksFishCoral-FrenchPoly/FIU-SharkFishCoral-FrenchPoly/R, and /Data, etc.
  5. Edit /2024-01_SharksFishCoral-FrenchPoly/FIU-SharkFishCoral-FrenchPoly/.gitignore. Under # User-specific files, add (files and) folders to not sync, in this case Docs/ , Nat_resources/ , NFF_data/ , Presentations/
  6. (save, close), git commit push w/ message in RStudio
  7. Check non-git syncing. Synology cloud sync says it's synced. gDrive ditto. Likely need to confirm this with other users.
  8. Check here() works: was:
setwd("/etc/2024-01_SharksFishCoral-FrenchPoly/NFF_data/")
site.order.df <- data.frame(read.csv("site_order_df.csv", header = TRUE, as.is = TRUE))

is now:

site.order.df <- data.frame(read.csv(here("NFF_data", "site_order_df.csv"), header = TRUE, as.is = TRUE))
  1. change read & save calls throughout existing scripts, remove setwd
  2. (scientific non) PROFIT

Folks will also need:
A. To download RaiDrive for windows, MountainDuck for Mac.
{TO CONTINUE}
B. Install & set it up: mirror to local drive. Then in browser, 'shared with me' folder, find target subfolder, right click, organise, create shortcut. Put that in your drive. It will then be synced locally, and any file changes by any party will be changed.
C.