Stressing the Plumber

We've started using plumber in our production environments. I finally got around to stress testing a smaller vm to see the results:

Box Specs:
Ubuntu 16.04 (On Azure)
4 vCPU's
64G ram

API Setup:
8 docker containers behind nginx as the load balancer (can include the settings for anyone interested but I'll need to clean them a bit, I did not enable caching)

This API usually responds in 0.2s under normal load.

Siege Results:

 siege -c100 -b -i -f url.txt -lsiegee.log
** SIEGE 3.0.8
** Preparing 100 concurrent users for battle.
The server is now under siege...
^C
Lifting the server siege...      done.

Transactions:                    723 hits
Availability:                  98.23 %
Elapsed time:                  32.95 secs
Data transferred:               3.53 MB
Response time:                  3.40 secs
Transaction rate:              21.94 trans/sec
Throughput:                     0.11 MB/sec
Concurrency:                   74.60
Successful transactions:         723
Failed transactions:              13
Longest transaction:            4.57
Shortest transaction:           0.27

During the siege I get these errors:

[alert] socket: 1987888896 select timed out: Connection timed out
[alert] socket: -1895835904 select timed out: Connection timed out
[alert] socket: -1929406720 select timed out: Connection timed out
[alert] socket: 1786464000 select timed out: Connection timed out
[alert] socket: 2055030528 select timed out: Connection timed out
[alert] socket: -2105653504 select timed out: Connection timed out
[alert] socket: 2080208640 select timed out: Connection timed out
[alert] socket: 1761285888 select timed out: Connection timed out
[alert] socket: -1879050496 select timed out: Connection timed out
[alert] socket: -1753159936 select timed out: Connection timed out
[alert] socket: 1870391040 select timed out: Connection timed out
[alert] socket: 2105386752 select timed out: Connection timed out
[alert] socket: -1962977536 select timed out: Connection timed out

(I cannot reproduce these errors under normal load)

Any guesses what those socket errors might be?

Example of a log during that error:

App=nginx,localtime=[08/Aug/2018:17:02:31 +0000],remoteaddr="XXXXXX",request="GET XXXXโ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
8 HTTP/1.1",worker_status=-,worker_addr="127.0.0.1:9301",upstream_response_time=-,request_time=3.550;round_trip_time=885

The hyphen stands out in the upstream response time.
According to Configuring Logging | NGINX Documentation

In case of internal error while connecting to an upstream or when a reply is taken from the cache, the variable contains โ€œ-โ€ (hyphen)

I don't have any caching configured, so it looks like an error connecting to the upstream box.

Upstream Response Time, Request Time, during siege

Additional Notes

When I check my container logs I see this: <simpleError in Ops.data.frame(REDACTED): โ€˜+โ€™ only defined for equally-sized data frames>

Any guesses what limit I'm running into here or where I could look for more informative logs? Is this plumber? or maybe the docker socket?

EDIT:
I know there's a stackexchange answer that suggests this could be an issue with a version of siege, I will build their dev version and check. But I doubt it, for one, I've seen this behavior in our production application before as well, under heavy load, and for two it's an update from 6 months ago.

3 Likes

Have you tried tweaking the number of docker containers? Since you have 4 cores, you could try 3 containers (1 core for nginx) and see if that improves your results? I'm not sure how much contention you'd need to start seeing timeouts in a siege run, but it does look like you have more containers than cores and that could cause contention issues.

That's the first thing I'd tweak. You might also try sticking with siege's default of 25 concurrent users for its benchmark mode. The amount of contention on the client side could also lead to weird results.

Good luck, let us know how it goes!

2 Likes

Heya Scott!

Thanks a lot for the response!

I tried both, the problem went away, and then I enthusiastically came here to post that it was definitely the CPUs or the local siege client. ... And then could no longer reproduce the error with my original settings. Which suggests a stateful component (socket usage maybe?). I'm trying to get back to it now, will let you know if and when I do.

If it's CPU related on the server (And the machine is intentionally underspecced, so I wouldn't be surprised),

what can I do to mitigate it? The apis involved are often idle except for flurries of usage. If I need to get a cpu for every container, I suspect I will be using a very large machine and not doing much with it most of the time

Additional Thoughts
In the above case, the API fails without a clear error at the load balancer. Is that just the nature of this kind of problem?

IE: The worker status in the log above is also -, while in other cases it would contain an http error code. Is it possible for plumber to return that error code to the load balancer?

Reproduced it with only 3 api containers and concurrent users at only 10:

It took a fair amount of testing to get to reproduce again, is it possible I'm getting buildup somewhere?

In order to do it, I needed to use a url file with a good number of bad requests (502 errors, 500 errors, you name it), successful requests don't seem to cause it.

So I used something containing mostly bad requests and ended up with this:

siege aborted due to excessive socket failure; you
can change the failure threshold in $HOME/.siegerc

I updated my failure threshold to 15000 and then was able to reproduce the error.

And then after that the error seemed easier to reproduce

siege -c10 -b -i -furl_for_siege.txt -lsiege.log
** SIEGE 3.0.8
** Preparing 10 concurrent users for battle.
The server is now under siege...
[Wed, 2018-08-08 20:47:27] HTTP/1.1 200   0.38 secs:   41802 bytes ==> GET  http://xxxxx
[Wed, 2018-08-08 20:47:27] HTTP/1.1 200   0.47 secs:   41802 bytes ==> GET  http://xxxxx
[alert] socket: 1342109440 select timed out: Connection timed out
[alert] socket: 1325324032 select timed out: Connection timed out
[alert] socket: 1316931328 select timed out: Connection timed out
[alert] socket: 1333716736 select timed out: Connection timed out
[alert] socket: 1300145920 select timed out: Connection timed out
[alert] socket: 1350502144 select timed out: Connection timed out
[alert] socket: 1291753216 select timed out: Connection timed out
[alert] socket: 1283360512 select timed out: Connection timed out
[alert] socket: 1308538624 select timed out: Connection timed out
[alert] socket: 1358894848 select timed out: Connection timed out

Are bad requests using up sockets?

1 Like

Hmm... maybe the next step is to rule out contention in the Docker networking layer? Can you proxy three R processes running the same API on the same container? (Would that defeat your purpose here?)

Went one further, reproduced it with no docker. It's a segfault in R.

Used just nginx on the host machine pointing at a single plumber node also running on the host upstream running on port 9400.

the docker container log managed to hide this one:

R Session on Remote

> library(plumber);srv <- plumb("plumber.R");srv$run(port = 9400)
Starting server to listen on port 9400

 *** caught segfault ***
address 0x11000088, cause 'memory not mapped'

URL Generation
built a large random testbed of requests:

library(httr)
url <- parse_url("http://<IP>/endpoint/function")

urllist <- replicate(1000,
{
url$query <- list(model_version = "1.0.0",
                  x = sample(-15:15, 1),
                  y   = sample(-120:105, 1),
                  z  = sample(30:55, 1),
                  w    = sample(-120:105, 1))
 build_url(url)
}, simplify = TRUE)

write(urllist, "urls_for_siege.txt")

NGINX.CONF

 ## Pool Definition:
    upstream endpoint_pool {
                least_conn;
                server localhost:9400 weight=1;
        }
## location definition
           location /endpoint/ {
                        access_log /var/log/nginx/apps/endpoint.log upstreamlog;
                        rewrite ^/endpoint/(.*)$ /$1 break;
                        proxy_set_header Upgrade $http_upgrade;
                        proxy_set_header Connection $connection_upgrade;
                        proxy_set_header Host $host;
                        proxy_set_header X-Real-IP $remote_addr;
                        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                        proxy_pass http://endpoint_pool ;
                        proxy_read_timeout 5m;
                        proxy_next_upstream error timeout http_500;
                }

Traceback

Traceback:
 1: FUN(X[[i]], ...)
 2: vapply(expr[-1], identical, logical(1), quote(.))
 3: is_first(rhs)
 4: split_chain(match.call(), env = env)
<5-11 Redacted>
12: do.call(private$func, args, envir = private$envir)
13: (function (...) {    args <- getRelevantArgs(list(...), plumberExpression = private$expr)    hookEnv <- new.env()    private$runHooks("preexec", c(list(data = hookEnv), list(...)))    val <- do.call(private$func, args, envir = private$envir)    val <- private$runHooks("postexec", c(list(data = hookEnv,         value = val), list(...)))    val})(res = <environment>, req = <environment>, model_version = "1.0.0",     x = "1", y = "-40", z = "43",     w = "-103")
14: do.call(h$exec, req$args)
15: doTryCatch(return(expr), name, parentenv, handler)
16: tryCatchOne(expr, names, parentenv, handlers[[1L]])
17: tryCatchList(expr, classes, parentenv, handlers)
18: tryCatch({    options(warn = 1)    h <- getHandle("__first__")    if (!is.null(h)) {        if (!is.null(h$serializer)) {            res$serializer <- h$serializer        }        req$args <- c(h$getPathParams(path), req$args)        return(do.call(h$exec, req$args))    }    if (length(private$filts) > 0) {        for (i in 1:length(private$filts)) {            fi <- private$filts[[i]]            h <- getHandle(fi$name)            if (!is.null(h)) {                if (!is.null(h$serializer)) {                  res$serializer <- h$serializer                }                req$args <- c(h$getPathParams(path), req$args)                return(do.call(h$exec, req$args))            }            .globals$forwarded <- FALSE            fres <- do.call(fi$exec, req$args)            if (!.globals$forwarded) {                if (!is.null(fi$serializer)) {                  res$serializer <- fi$serializer                }                return(fres)            }        }    }    h <- getHandle("__no-preempt__")    if (!is.null(h)) {        if (!is.null(h$serializer)) {            res$serializer <- h$serializer        }        req$args <- c(h$getPathParams(path), req$args)        return(do.call(h$exec, req$args))    }    for (mountPath in names(private$mnts)) {        if (nchar(path) >= nchar(mountPath) && substr(path, 0,             nchar(mountPath)) == mountPath) {            req$PATH_INFO <- substr(req$PATH_INFO, nchar(mountPath),                 nchar(req$PATH_INFO))            return(private$mnts[[mountPath]]$route(req, res))        }    }    val <- private$notFoundHandler(req = req, res = res)    return(val)}, error = function(e) {    val <- private$errorHandler(req, res, e)    return(val)}, finally = options(warn = oldWarn))
19: self$route(req, res)
20: self$serve(req, res)
21: func(req)
22: compute()
23: doTryCatch(return(expr), name, parentenv, handler)
24: tryCatchOne(expr, names, parentenv, handlers[[1L]])
25: tryCatchList(expr, classes, parentenv, handlers)
26: tryCatch(compute(), error = function(e) compute_error <<- e)
27: rookCall(.app$call, req, req$.bodyData, seek(req$.bodyData))
28: (function (req, cpp_callback) {    resp <- rookCall(.app$call, req, req$.bodyData, seek(req$.bodyData))    clean_up <- function() {        if (!is.null(req$.bodyData)) {            close(req$.bodyData)        }        req$.bodyData <- NULL    }    if (is.promise(resp)) {        resp <- resp %...>% invokeCppCallback(., cpp_callback)        finally(resp, clean_up)    }    else {        on.exit(clean_up())        invokeCppCallback(resp, cpp_callback)    }    invisible()})(<environment>, <pointer: 0x7f9828048140>)
29: evalq((function (req, cpp_callback) {    resp <- rookCall(.app$call, req, req$.bodyData, seek(req$.bodyData))    clean_up <- function() {        if (!is.null(req$.bodyData)) {            close(req$.bodyData)        }        req$.bodyData <- NULL    }    if (is.promise(resp)) {        resp <- resp %...>% invokeCppCallback(., cpp_callback)        finally(resp, clean_up)    }    else {        on.exit(clean_up())        invokeCppCallback(resp, cpp_callback)    }    invisible()})(<environment>, <pointer: 0x7f9828048140>), <environment>)
30: evalq((function (req, cpp_callback) {    resp <- rookCall(.app$call, req, req$.bodyData, seek(req$.bodyData))    clean_up <- function() {        if (!is.null(req$.bodyData)) {            close(req$.bodyData)        }        req$.bodyData <- NULL    }    if (is.promise(resp)) {        resp <- resp %...>% invokeCppCallback(., cpp_callback)        finally(resp, clean_up)    }    else {        on.exit(clean_up())        invokeCppCallback(resp, cpp_callback)    }    invisible()})(<environment>, <pointer: 0x7f9828048140>), <environment>)
31: doTryCatch(return(expr), name, parentenv, handler)
32: tryCatchOne(expr, names, parentenv, handlers[[1L]])
33: tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
34: doTryCatch(return(expr), name, parentenv, handler)
35: tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]),     names[nh], parentenv, handlers[[nh]])
36: tryCatchList(expr, classes, parentenv, handlers)
37: tryCatch(evalq((function (req, cpp_callback) {    resp <- rookCall(.app$call, req, req$.bodyData, seek(req$.bodyData))    clean_up <- function() {        if (!is.null(req$.bodyData)) {            close(req$.bodyData)        }        req$.bodyData <- NULL    }    if (is.promise(resp)) {        resp <- resp %...>% invokeCppCallback(., cpp_callback)        finally(resp, clean_up)    }    else {        on.exit(clean_up())        invokeCppCallback(resp, cpp_callback)    }    invisible()})(<environment>, <pointer: 0x7f9828048140>), <environment>),     error = function (x)     x, interrupt = function (x)     x)
38: .Call("_later_execCallbacks", PACKAGE = "later", timeoutSecs)
39: execCallbacks(timeoutSecs)
40: run_now(check_time)
41: service(0)
42: httpuv::runServer(host, port, self)
43: srv$run(port = 9400)
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)

So the seg fault here is definitely concerning. But it looks like what you are calling bad requests are very dependent on your Plumber API. Is this correct? Is there a reason that the parameters that you have selected are inherently bad?

Does your API parse query parameters, I presume?

So, I reproduced it without actually engaging the endpoint (Using 404 errors and successes in sequence, posted it as an issue on trestletech/plumber. Though I think it's actually R 3.4.4 and httpuv related)

I've done a fair amount of additional debugging since and replaced with a simple plumber file to reproduce.

The error does not appear in R 3.5, though unfortunately I'm on ubuntu and need to wait for CRAN to upgrade (I know there's a ppa, but I want to wait for the stable cran release)

My current theory is that something bad is happening when the client abandons the pipe under those circumstances

EDIT:

Looking through 3.5 release notes

https://cran.r-project.org/doc/manuals/r-release/NEWS.html
It looks like they changed a lot of how R handles buffered connections.

So maybe a buffer overflow? That would be rough though.

3 Likes

Resolved. Was a problem with signal handling in httpuv, httpuv released a commit to fix. Special thanks to @atheriel and @wch

4 Likes