Dynamic Nginx Router... in Go!
We needed a specialized load balancer at Nitro. After some study, Mihai Todor and I built a solution that leverages Nginx, the Redis protocol, and a Go-based request router where Nginx does all the heavy lifting and the router carries no traffic itself. This solution has worked great in production for the last year. Here’s what we did and why we did it.
Why?
The new service we were building would be behind a pool of load balancers and was going to do some expensive calculations—and therefore do some local caching. To optimize for the cache, we wanted to try to send requests for the same resources to the same host if it were available.
There are a number of off-the-shelf ways to solve this problem. A non-exhaustive list of possibilities includes:
Go Gopher by Renee French.
- Using cookies to maintain session stickiness
- Using a header to do the same
- Stickiness based on source IP
- HTTP Redirects to the correct instace
This service will be hit several times per page load and so HTTP redirects are not viable for performance reasons. The rest of those solutions all work well if all the inbound requests are passing through the same load balancer. If, on the other hand, your frontend is a pool of load balancers, you need to be able to either share state between them or implement more sophisticated routing logic. We weren’t interested in the design changes needed to share state between load balancers at the moment and so opted for more sophisticated routing logic for this service.
Our Architecture
It probably helps to understand our motiviation a little better to understand a bit about our architecture.
We have a pool of frontend load balancers and instances of the service are deployed on Mesos so they may come and go depending on scale and resource availability. Getting a list of hosts and ports into the load balancer is not an issue, that’s already core to our platform.
Because everything is running on Mesos, and we have a simple way to define and deploy services, adding any new service is a trivial task.
On top of Mesos, we run gossip-based Sidecar everywhere to manage service discovery. Our frontend load balancers are Lyft’s Envoy backed by Sidecar’s Envoy integration. For most services that is enough. The Envoy hosts run on dedicated instances but the services all move between hosts as needed, directed by Mesos and the Singularity scheduler.
The Mesos nodes for the service under consideration here would have disks for local caching.
Design
Looking at the problem we decided we really wanted a consistent hash ring. We could have nodes come and go as needed and only the requests being served by those nodes would be re-routed. All the remaining nodes would continue to serve any open sessions. We could easily back the consistent hash ring with data from Sidecar (you could substitute Mesos or K8s here). Sidecar health checks nodes and so we could rely on nodes being available if they are alive in Sidecar.
We needed to then somehow bolt the consistent hash to something that could direct traffic to the right node. It would need to receive each request, identify the resource in question, and then pass the request to the exact instance of the service that was prepped to handle that resource.
Of course, the resource identification is easily handled by a URL and any load balancer can take those apart to handle simple routing. So we just needed to tie that to the consistent hash and we’d have a solution.
You could do this in Lua in Nginx, possibly in HAproxy with Lua as well. No one at Nitro is a Lua expert and libraries to implement the pieces we needed were not obviously available. Ideally the routing logic would be in Go, which is already a critical language in our stack and well supported.
Nginx has a rich ecosystem, though, and a little thinking outside the box
turned up a couple of interesting Nginx plugins, however. The first of these is
the nginx-eval-module
by
Valery Kholodkov. This allows you to make a call from Nginx to an endpoint and
then evaluate the result into an Nginx variable. Among other possible uses, the
significance of that for us is that it allows you to dynamically decide which
endpoint should receive a proxy-pass. That’s what we wanted to do. You make a
call from Nginx to somewhere, you get a result, and then your make a routing
decision based on that value.
You could implement the recipient of that request with an HTTP service that
returns only a string with the hostname and port of the destination service
endpoint. That service would maintain the consistent hash and then tell Nginx
where to route the traffic for each request. But making a separate HTTP
request, even if were always contained on the same node, is a bit heavy. The
whole expected body of the reply would be something like the string
10.10.10.5:23453
. With HTTP, we’d be passing headers in both directions that
would vastly exceed the size of the response.
So I started to look at other protocols supported by Nginx. Memcache protocol and Redis protocol are both supported. Of those, the best supported from a Go service is Redis. So that was where we turned.
There are two Redis modules for Nginx. One of them is suitable for use with the
nginx-eval-module
. The best Go library for Redis is
Redeo. It implements a really simple handler
mechanism much like the stdlib http
package. Any Redis procotol command will
invoke a handler function, and they are really simple to write. Alas, it only
supports a newer Redis protocol than the Nginx plugin can handle. So, I dusted
off my C skills and patched the Nginx
plugin to use the newest Redis
protocol encoding.
So the solution we ended up with is:
The call comes in from the Internet, hits an Envoy node, then an Nginx node. The Nginx node (1) asks the router where to send it, and then (2) Nginx passes the request to the endpoint.
Implementation
We built a library in Go to manage our consistent hash backed by Sidecar or by Hashicorp’s Memberlist library. We called that library Ringman. We then bolted that libary into a service which serves Redis protocol requests via Redeo.
Only two Redis commands are required: GET
and SELECT
. We chose to implement a few more commands for debugging purposes, including INFO
which can reply with any server state you’d like. Of the two required commands, we can safely ignore SELECT
, which is for selecting the Redis DB to use for any subsequent calls. We just accept it and do nothing.GET
, which does all the work, was easy to implement. Here’s the entire function to serve the Ringman endpoint over Redis with Redeo. Nginx passes the URL it received, and we return the endpoint from the hash ring.
That is called by Nginx using the following config:
We deploy Nginx and the router in containers and they run on the same hosts so we have a very low call overhead between them.
We build Nginx like this:
Performance
We’ve tested the performance of this extensively and in our environment we see about 0.2-0.3ms response times on average for a round trip from Nginx to the Go router over Redis protocol. Since the median response time from the upstream service is about 70ms, this is a negligeable delay.
A more complex Nginx config might be able to do more sophisticated error handling. Reliability after a year in service is extremly good and performance has been constant.
Wrap-Up
If you have a similar need, you can re-use most of the components. Just follow the links above to actual source code. If you are interested in adding support for K8s or Mesos directly to Ringman, that would be welcome.
This solution started out sounding a bit like a hack and in the end has been a great addition to our infrastructure. Hopefully it helps someone else solve a similar problem.