Solving Database Contention in a High-Volume, Kubernetes-Managed Service

Our service, catering to over 50+ million users daily, runs in Kubernetes with deployments up to 79 pods. Each pod manages 20 sessions, and every session handles a user-configured cluster. The issue emerged during redeploys: with 79 pods (1580 sessions) restarting, all attempting to claim a cluster every minute, we faced severe database contention in our CockroachDB setup. This contention led to latencies far beyond our acceptable range.

The Heart of the Problem: Database Contention

In the context of CockroachDB, a distributed SQL database, contention occurs when multiple transactions compete for the same data or resources. This is particularly relevant during high-traffic scenarios, such as the startup process of our Kubernetes-managed service.

During our service's startup phase, each of the 79 pods attempts to manage 20 sessions, leading to a total of 1580 sessions concurrently trying to claim a user-configured cluster. This contention is exacerbated by the fact that each session is attempting to claim the same cluster, and thus the same database row. That results in a large number of transactions attempting to read/write to the same row, which builds up a queue of transactions waiting to access the row. This queue is known as a lock wait queue, and it is the primary cause of contention.

Contention alone isn't always bad, it is a natural part of any database system. However, when contention is high, it can lead to increased latency and even deadlocks. In our case, the contention was so high that it led to latencies of up to 30 seconds, which is far beyond our acceptable range. As this occuried on a weekand and me being the only engineer on call, I set out to find a solution.

Strategic Solution: A Two-Pronged Approach

To tackle this problem, I employed a two-pronged strategy focusing on modifying the deployment and then the code.

1. Modifying the Deployment

The first step was to adjust the pod deployment rate. Instead of launching 25 pods simultaneously, which exacerbated the load on our database, I reduced the number to a more manageable 5 pods at a time (maxSurge=5).

2. Code-Level Solution: Redis to the Rescue

The next step involved a deeper dive into our codebase. The real issue was the concurrent dtabase queries from multiple sessions. To address this, I turned to Redis's Distributed Locks to ensure that only one session could claim a cluster at a time. This approach effectively serialized the startup process because each session had to wait for the previous session to finish claiming a cluster before it could claim one itself.

This not only alleviated the database contention, but also sped up the startup process because rather than contention causing serious latencies (up to 5s), the startup process was now limited by the time it took to claim a cluster (which was really fast, less than 100ms).

Post-Mortem: Additional Changes

Building on the success of the Redis locks, I also implemented a checkpoint-based startup system. This system ensured that before a session was unsettled during a shutdown, it was double-checked for any modifications. This added layer of verification meant fewer unnecessary queries to the database, further streamlining our operations.

Reflections

As the only engineer available, tackling this issue threw me into unfamiliar territory. I had no option but to dive in and give it my all. This challenge pushed me to rapidly expand my technical understanding, demanding quick research, swift code changes, and a relentless focus on maintaining uninterrupted service. Performing these tasks under pressure, in a large-scale, distributed system, was a true test of agility and adaptability. This experience reinforced the reality of software engineering: sometimes, you have to tackle challenges head-on, even when they fall outside your comfort zone.

Vocabulary

Kubernetes (K8s): A system for automating the deployment and management of applications in a containerized environment. It organizes clusters of machines to efficiently manage and scale applications.
Pod: The basic operational unit in Kubernetes. Each pod encapsulates one or more applications, along with all necessary resources and dependencies, in a single operational environment.
Session: Refers to a temporary interactive information exchange between two systems. In Kubernetes, it typically means an active interaction between the application and its user-configured environment.
CockroachDB: A distributed SQL database designed for horizontal scalability and resilience, capable of efficiently distributing data across multiple servers and ensuring high availability.
Database Contention: Occurs when multiple operations compete for access to the same data in a database, potentially leading to performance bottlenecks due to these conflicts.
Distributed SQL Database: A database system that spreads its data across multiple servers, enhancing data availability and scalability by distributing workload and storage requirements across a network of nodes.
Lock Wait Queue: A system in database management where transactions wait in line to access certain data currently in use by another transaction. This helps manage data consistency but can cause delays if the queue is long.
Redis: An in-memory data structure store used as a database, cache, or message broker, known for its speed in managing data that requires quick and efficient access.
Atomic/Distributed Locks: Mechanisms ensuring that a piece of code or a data structure can only be accessed by one process at a time, preventing conflicts in scenarios where multiple processes might try to modify the same data simultaneously.