To the RStudio Cloud Community:
As I am sure many of you have noticed, we have recently been experiencing a performance degradation on RStudio Cloud. These performance issues have impacted the process of creating new projects, as well as opening existing projects. We’ve observed unacceptably long wait times as well as a high failure rate completing these operations. This undoubtedly makes for a very painful experience trying to use RStudio Cloud.
First, I want to acknowledge this failure, and to apologize for the inconvenience it has caused. I also want you to know that we are taking all available steps to remedy the issue. We truly appreciate your patience as we work through these problems. To that end, we’ve taken some emergency measures to mitigate the problem. These temporary measures should restore stability while we continue to investigate the root cause of the problem.
In the interest of transparency, I thought it might be useful to explain the situation and the steps we are taking to find a permanent solution. As you may know, RStudio Clouds runs your projects using containers. If you’re not familiar with a container, the simplest explanation is that it is a type of virtual server that isolates workloads and resource utilization, while having less overhead than a traditional virtual server. Using containers allows us to run many projects at once while maintaining strict isolation between them but with minimal overhead. Each container has a number of storage volumes that are used to store project data including source files as well as installed packages. When a new project is created, we provision a new container and new volumes for that project. To run our service as economically as possible, these containers are suspended when they are not in use. When a project is opened, these containers are resumed from their sleeping state and their volumes are re-attached. The errors and delays you might have observed are related to this process of resuming or provisioning these containers.
Specifically, the problem we have observed lies within the mechanism that is used to attach these volumes to a container when its starting. This is handled by a third-party component that is used by our system to help us orchestrate the large number of containers that we manage. The issue that we’ve discovered is that when many projects are being suspended simultaneously, containers that are attempting to start experience a drastic delay attaching their volumes. We believe this is a previously unidentified bug within this component. Consequently, we are working with our infrastructure provider to investigate the problem with the hope of finding a solution. This work is ongoing, but we are hopeful that there will be a resolution soon. Unfortunately, we do not yet have a time frame for when that might occur.
In the meantime, we have taken several actions to address this issue: 1) As was mentioned above, we took the emergency step to rollback back our orchestration system to an older version. This should bring an immediate performance improvement to newly created projects while we research a long-term solution. 2) We are working on changes to the algorithm that we use to suspend idle containers. We believe the changes to the suspension algorithm will alleviate the problem in the vast majority of cases.
I would again like to reiterate that we apologize for the inconvenience this has caused. Our number one goal is to provide a reliable service that you can count on. I know for some of you we have failed to meet that goal, but please know that we are committed to resolving these issues as quickly as we can. Thank you for your continued support.
-Andy
Lead Engineer, RStudio Cloud