opensource.google.com

Menu

Seesaw: scalable and robust load balancing

Friday, January 29, 2016

Like all good projects, this one started out because we had an itch to scratch…

As Site Reliability Engineers who manage corporate infrastructure at Google, we deal with a large number of internally used services that need to be load balanced for scalability and reliability. In 2012, two different platforms were used to provide load balancing, both of which presented different sets of management and stability challenges. In order to alleviate these issues, our team set about looking for a replacement load balancing platform.

After evaluating a number of platforms, including existing open source projects, we were unable to find one that met all of our needs and decided to set about developing a robust and scalable load balancing platform. The requirements were not exactly complex - we needed the ability to handle traffic for unicast and anycast VIPs, perform load balancing with NAT and DSR (also known as DR), and perform adequate health checks against the backends. Above all we wanted a platform that allowed for ease of management, including automated deployment of configuration changes.

One of the two existing platforms was built upon Linux LVS, which provided the necessary load balancing at the network level. This was known to work successfully and we opted to retain this for the new platform. Several design decisions were made early on in the project — the first of these was to use the Go programming language, since it provided an incredibly powerful way to implement concurrency (goroutines and channels), along with easy interprocess communication (net/rpc). The second was to implement a modular multi-process architecture. The third was to simply abort and terminate a process if we ended up in an unknown state, which would ideally allow for failover and/or self-recovery.

After a period of concentrated development effort, we completed and successfully deployed Seesaw v2 as a replacement for both existing platforms. Overall it allowed us to increase service availability and reduce management overhead. We're pleased to be able to make this platform available to the rest of the world and hope that other enterprises are able to benefit from this project. You can find the code at https://github.com/google/seesaw.

By Joel Sing, Google Site Reliability Engineer
.