I have a plumber API that takes unexpectedly long to run and I have trouble understanding what the reason for that could be. Let me give you some context:
that API doesn't do any fancy calculations, locally runs in a couple of seconds for json files of 4-5 MB (including importing data). Our expectation was that it would run similarly fast when wrapped in a plumber API, but to our surprise it's quite the contrary.
Local API - timing of the most important code pieces
Starting server to listen on port 7992
Running the swagger UI at http://127.0.0.1:7992/__swagger__/
Working on it...
Time difference of 3.83 secs<- loading json
Raw file ready to rock...
887
Time difference of 0.81 secs<- data prep
Time difference of 0.10 secs<- variables calculation
Other than that there are no other operations that take time. As you can see the same code runs for over 20 seconds on RSC vs. nearly 5 seconds locally. How come there's such a huge difference between those timings between RSC and when running locally? Is there any way to speed that up?
Interesting issue. I wonder if it is specific to your api or if it is a more general problem.
I don’t have any specific advice regarding plumber, but it would be helpful to know
Do other plumber apis have a similar performance issue for you in local vs RSC?
Do you have any similar issues with Shiny apps run locally vs RSC?
I had similar issues for a shiny app once and it turned out to be an issue on my network on how long a specific JavaScript library was taking to load which we diagnosed using the chrome developer tools
Not really actually. It's a standard way to retrieve a json request in an API and all our APIs start with retrieving the body that way. I haven't noticed similar behaviour elsewhere, and especially not so apparent.
No, there's no difference.
With regards to your last point - wouldn't the detailed breakdown of timings prove that file transferring etc. takes a minor share of the entire processing time, thus it can't be any issue related to the network or so?
If there is no issue between any other plumber apis on RSC and running Shiny apps on RSC, that to me suggests it is specific to this one plumbr api - is that a reasonable assumption?
If so, can you profile the underlying R code for the api running directly on RSC vs local to see if there are any difference? Profviz is really useful for this type of analysis: https://rstudio.github.io/profvis/
It's a reasonable and logical assumption, however, on the other hand there's nothing in the results of the API that supports that statement. I've also timed that using provis and doing simple time differences (as presented above) and none of the proves there could be such issue. That's why I'm a little bit clueless what the reason for that could be in this case.
Sorry to hear that - I am also at a loss at what might be happening. I would suggest reaching out to the rstudio support directly to help diagnose the problem.
@konradino based on the files you shared with me, this is a network issue. Given that the body of the response contains a large (>75 MB) JSON file, network latency is what's causing the discrepancy between a local performance and RStudio Connect performance.
Consider the following script that uses CURL to make a request to both a local endpoint and a remote endpoint hosted on RStudio Connect:
#!/bin/bash
echo 'url,time_namelookup,time_connect,time_appconnect,time_pretransfer,time_redirect,time_starttransfer,time_total,size_upload,speed_upload,size_download,speed_download' > timings.csv
for url in '<local-endpoint>' '<rstudio-connect-endpoint>'
do
for i in {1..10}
do
curl -d "@test.json" -w "%{url_effective},%{time_namelookup},%{time_connect},%{time_appconnect},%{time_pretransfer},%{time_redirect},%{time_starttransfer},%{time_total},%{size_upload},%{speed_upload},%{size_download},%{speed_download}\\n" -o /dev/null -X POST ${url} >> timings.csv
done
done
This script creates a .csv file containing timing details about each request. Analyzing the results reveals that nearly all of the time discrepancy between the two APIs can be accounted for with download and upload speed.
url
size_download
speed_download
time_total
download_time
Local Endpoint
82540139
27285996
3.025613
3.025000
Local Endpoint
82540139
27467600
3.005494
3.005000
Local Endpoint
82540139
27349283
3.018092
3.018000
Local Endpoint
82540139
30401524
2.715642
2.715000
Local Endpoint
82540139
26921115
3.066180
3.066000
Local Endpoint
82540139
27458462
3.006494
3.006000
Local Endpoint
82540139
27313083
3.022872
3.022000
Local Endpoint
82540139
27331171
3.020641
3.020000
Local Endpoint
82540139
28141881
2.933028
2.933000
Local Endpoint
82540139
27550113
2.996060
2.996000
RSC Endpoint
82540139
6719868
12.283238
12.283000
RSC Endpoint
82540139
6846963
12.055490
12.055000
RSC Endpoint
82540139
9036581
9.134867
9.134001
RSC Endpoint
82540139
8575598
9.625306
9.625001
RSC Endpoint
82540139
7688881
10.735571
10.735000
RSC Endpoint
82540139
9607745
8.591900
8.591000
RSC Endpoint
82540139
9847308
8.382065
8.382000
RSC Endpoint
82540139
9207958
8.964228
8.964000
RSC Endpoint
82540139
9427771
8.755820
8.755000
RSC Endpoint
82540139
9588770
8.608930
8.608001
In order to improve performance, these are a couple of options that come to mind:
Use a different serializer
Depending on the downstream consumer of this API, you could serialize into something more compressed than JSON in order to cut down on the size of the response.
Create a paginated response
Instead of returning everything to the client at once, allow the endpoint to only return a few records at a time, and provide a mechanism for requesting subsequent records.
Thank you very much for your detailed answer and taking the time to investigate! I have completely missed the fact that from only 61KB request a 70+ MB response is generated - that's obviously a lot of redundancy that we can cut on in order to speed things up. I checked the results on a real example and the factor is not that large but a real output is often 5-10MB in the end or bigger than that.
I believe the original design was to preserve as much information as possible about what's happening within the API which includes saving all information from the request and adding further to it in the response. I guess it's a good idea for debugging but adds a lot of load to processing the entire request? In general, should I try to keep the response as light as possible?
Apart from that, was there anything else I could do within the API to speed it up? I had an impression that it's quite streamlined already.
A couple more questions:
Use a different serializer
Depending on the downstream consumer of this API, you could serialize into something more compressed than JSON in order to cut down on the size of the response.
What would be a better choice in this case?
Create a paginated response
Instead of returning everything to the client at once, allow the endpoint to only return a few records at a time, and provide a mechanism for requesting subsequent records.
Not quote sure what you mean. Could you refer me to some resources?
The API itself seems streamlined and is responsive, so I don't think there's much else that can be done to improve performance there.
In regards to a different serializer, this really depends on the downstream consumers of the API. If these downstream consumers are other R processes, you could serialize an .rds file instead of raw JSON. Otherwise, you might look into compressing the output using something like gzip. Either of these approaches would require you to create your own serializer. The following example illustrates how you might do this using gzip:
Using this type of serialization results in the following:
url
size_download
speed_download
time_total
download_time
Local Endpoint
321289
101545
3.164260
3.164006
Local Endpoint
321289
100090
3.210613
3.210001
Local Endpoint
321289
100027
3.212998
3.212023
Local Endpoint
321289
95281
3.372121
3.372015
Local Endpoint
321289
98253
3.270463
3.270017
Local Endpoint
321289
100465
3.198819
3.198019
Local Endpoint
321289
100152
3.208099
3.208014
Local Endpoint
321289
98223
3.271237
3.271016
Local Endpoint
321289
103076
3.117654
3.117011
Local Endpoint
321289
102419
3.137675
3.137006
RSC Endpoint
321289
49307
6.516929
6.516093
RSC Endpoint
321289
67441
4.764885
4.764001
RSC Endpoint
321289
69243
4.640754
4.640021
RSC Endpoint
321289
67257
4.777718
4.777034
RSC Endpoint
321289
71604
4.487497
4.487026
RSC Endpoint
321289
74944
4.287704
4.287054
RSC Endpoint
321289
59531
5.397581
5.397003
RSC Endpoint
321289
67201
4.781622
4.781015
RSC Endpoint
321289
66782
4.811138
4.811012
RSC Endpoint
321289
69830
4.601685
4.601017
Note that in this case the client must be able to handle the .gz file, since it is no longer just vanilla JSON.
Pagination isn't something I've implemented in Plumber, but it should be possible. The idea is that you only return the first n responses to a request, and include some reference in the response that can be used to collect the next n responses. This is most useful if you anticipate that not every request requires the full response immediately. If every request requires the full response, then this approach isn't beneficial.
@Blair09M - would it also be possible to increase the speed of upload by using custom serializers?
The reason I'm asking is the following: my total API timing is equal to 12 seconds, but when I time its individual elements it sums up to 5 seconds. The output (response) is very small (54KB) and the code inside runs really fast. So most time in the API itself really is consumed in the beginning:
converting the request into something R can digest really blocks my API, but I'm not sure if I can speed it up here? Are there options also for passing data in a compressed way?
simpleError in decompressFile.default(filename = filename, ..., ext = ext, FUN = FUN): No such file:
If that's possible it would also decrease the overall time - the request json had 7MB and only it's upload took the missing 7 seconds from the overall processing time. If we also could nail that it would be awesome!
Also, I ended up experimenting with that gzip solution you posted but something doesn't seem to be working for me. The response of a file.json.gz that I'm getting is:
{"code":30,"error":"We couldn't log you in with the provided credentials. Please ask your RStudio Connect administrator for assistance.","payload":null}
the only difference from the previous json call (except for a different endpoint) is that I add: