I am an end-user doing a little research for the engineering team -- so my descriptions of our system are going to be very handwavy.
System Details We are getting a brand new system and some brand new Posit Team licenses up an running on a single compute server (Linux, no slurm, no kubernetes) that will be used by a some handful (but rapidly growing #s) of mostly new-to-R analysts.
Issue In our old (free) Rstudio server environment, our system was generally well sized for the users, but we sometimes ran into instances when someone inadvertently used all cores (due to underlying detectCores() or availableCores() within functions-- not intentionally) and disrupted the environment for everyone. We would like to avoid or reduce this with the tools we have.
We thought this would be possible with Workbench but it doesn't seem to be.
What do people recommend, without (yet?) adopting slurm or kubernetes?
We don't want to overengineer the problem -- currently, we don't have many of users (few cores in use at a given time), and on occasion, someone has a heavy computational project where, with a little communication, we're ok with them making use of a substantial chunk of the environment for a period.
But it has certainly happened, and it is certainly a pain for all, and we would like to ensure that it happens very infrequently (we do not need to ensure that it can never happen).
Some thoughts:
I would love if we could set an environment variable on the system for some reasonable number to be returned by detectCores/ availableCores. This could be set to some reasonable number for all, and in the instances where someone has permission to use a substantial part of the server, they could override it. I really, really wanted this to exist, but as far as I can tell, it only does for {parallelly}. We could set the R_PARALLELLY_AVAILABLECORES_FALLBACK variable, which would take care of all packages that use {parallelly} for parallelizing, but that still leaves us vulnerable for anything that uses parallel::detectCores(). (Any idea what percentage of modern packages use each? I have no clue.)
Most of our users are new-to-R and learning tidyverse. I want to believe this won't happen very often for our users (though history says it has happened plenty often to be problematic for those of us who rely on the R environment.) What packages use detectCores()? Is it possible to scrape CRAN for those packages and do a little active monitoring of people downloading those packages from Package Manager? Send people a warning message if they download {foreach} or {doParallel}?
Uninstall {parallel}. Create new {parallel} that incorporates the code from Attachment #2479 for bug #17641 and host that on Package Manager. (This sounds both horrifying and so tempting.)
This is a great question. I think if this is a concern long-term, Slurm or K8S are your best way to ensure user-level isolation.
One other option is to use the user + group limit setting in Posit Workbench. These can be set individually for users or groups of users as you need. Administration Guide - User and Group Profiles PRO
Thanks, Alex. I think our team does know that slurm or k8s are likely in our future, but we're slowly testing out options and are trying to figure out the best way forward with what we have.
The User/Group Profiles don't really seem to fit the use case of dynamically changing needs for an individual user, unless I am missing something?
eg, I personally spend easily 10x as much time using a single node than multiple nodes, in this environment, but when I want to parallelize over 8 or 16 cores, I very much appreciate the ability to do so without going through any admin approval or modification of permissions.
Given that k8s and slurm are off the table at the moment as we are taking things one step at a time, would you recommend the Linux cgroup option?
Is there Posit documentation for Cgroups with Workbench? I see some issues in rstudio (Issues · rstudio/rstudio · GitHub) -- none of them are concerning to my very untrained eye, but if there is anything that our engineering team should know about Cgroups with Workbench, I'd love to pass that on.
The user and group profiles that my colleague mentions are a good starting point but mostly useful for limiting maximum memory available for each user.
The missing feature there is the limitation of CPU power available to each user. You neither want to create a new {parallel} package and you also don't rely on the {parallelly} environment variable mentioned. We have some other Posit Workbench users that add additional cgroups like
Which will essentially allow all users in the rstudio group only get a share of the equivalent of 3 cores at most.
Technical explanation: The Linux kernel scheduler keeps track on how much cpu power each process receives in a given interval defined by cpu.cfs_period_us (100 ms default). If a user that is part of the rstudio-user group gets a maximum cpu time of 300 ms every 100 ms, this user can make use of effectively 300/100 = 3 cores at most, no matter how many threads or processes he/she launches.
The above can of course tailored for multiple groups (power users, "normal" users) but will definitely help to reduce the risk of your server falling over considerably until you are ready to embrace the SLURM or Kubernetes story. Those systems will ensure that every user can get dedicated resources.
Indeed, this is a Linux-specifc thing. Also this approach unfortunately does not work with Workbench profiles for Local Launcher (yet) and hence needs to be dealt with separately.
There is some stuff in the works to that effect but it's not clear when this will be readily available.