I love Shiny, but do run into problems since I don't really understand what I am doing Hopefully, someone will assist me here and forgive me lack of proper background. Here's the issue:
We provide an internal shiny app with a max. of about 100 concurrent users. The app itself is fairly simple, but fetches data from different databases.
This is the problem:
Mulitple users have reported performance issues (slow load or even no load at all), but not all users are affected. I could not yet identify the cause for these issues and couldn't reproduce them so far. I hope this post will help me better understand how shiny server works and if the performance issues could be linked to it respectively the way we have set it up.
multiple identical shiny servers are run as pods in OpenShift
according to onboard OpenShift metrics, neither CPU nor RAM are near limits for any pod
users are distributed evenly to the individual pods by a a load balancer
each shiny server runs 1 worker (restriction on open source version I believe)
each worker accepts multiple http and websocket connections
logged max.: 28 websocket, 5 http, 9 pending per one shiny server
according to logs, all shiny servers have pending connections most of the time (even off working hours), e.g.: "Worker #99999999999 releasing http port. 0 HTTP, 2 WebSocket, 4 pending."
other unsettling log entries which reoccur: "pending session timer expired", "HTTP client error (undefined): read ECONNRESET"
Some things I would like to know:
Is there a restriction on the number of websockets or http connections per one shiny server (open source)?
If so, what are the restrictions and can I change them via parameters?
To me it seems, I have dangling pending connections - could these be the problem? How are these caused?
Any other hints?
Would be great to get some help on this!
Thanks
Alex
I've had similar issues when scaling Shiny apps (and the same problem with other frameworks such as Streamlit). I'm not working on any big Shiny project, but I can share my recent experience with Streamlit; our approach might help you.
My team manages infrastructure for other companies, and we recently helped a company scale a Streamlit app that had the same issues you outlined (the use case was similar: ask for user input, download data from a database, process it, and display some tables and charts to the user). Some users would experience unsuccessful web socket connections past a certain number of concurrent users. When we monitored resource usage, we noticed memory leaks: RAM usage would keep increasing even after the app finished processing the data; this would cause a single user to clog resources even if they were only looking at the final results, and not triggering any heavy processing (so it's kind of surprising that you're not seeing any peaks in RAM in your metrics).
We resolved this by spinning up a job queue that does the heavy processing. Upon user input submission, the data processing job is triggered and executed in a separate machine. Jobs are queued to guarantee that we limit the number of concurrent jobs. Once the job finishes, it sends the result to the app and displays the tables and charts. The app becomes essentially a frontend that consumes a small amount of resources, and memory leaks are eliminated since the process is killed once it finishes.
The only caveat is that we had to re-implement parts of the application.
I think our problem is slightly different since there is hardly any R calculation involved - our app basically shows one row of a very simple db query. Also, none of our pods has reached its RAM limits and while one user is handled perfectly another user is not handled at all at the same time.
My guess would rather be that there is a problem setting up concurrent http or websocket connections past a certain number. Could you determine any restrictions on this in your use case?