Building adaptive systems

Every production service has targets it needs to hit. These targets are often measured by successful requests per second or 99th percentile latency. In order for this service to be considered resilient, it should attempt to reach these targets even when confronted with overload or failures in the rest of the system. The tools that engineers have typically employed to stop cascading failure, such as circuit breakers, are a poor fit for building services that can change to an ever-changing production system. What we’d like instead is for our services to protect themselves, protect each other, and react to failures without operator intervention. In this talk, we’ll look at ways to build systems that can adapt to changes in latency, spikes in traffic, and systemic failures. In order to achieve our goals, we’ll discuss some basic queueing theory, congestion control algorithms, and how we can take advantage of these concepts in our systems.


Demonstrate the benefits of adaptive concurrency limits in production


This talk should appeal to anyone running medium to high scale systems that need to adapt to different traffic spikes or failures in the system.