From May 25 - May 27, Bitbucket Cloud engineers were in the process of gradually rolling out new behavior that improves security. The new behavior changed the traffic profile of an authentication service in a way that was not detected by auto-scaling logic.
In this state, the usual spikes in traffic at the top of the hour overwhelmed the service on two occasions: May 25 21:55 - May 26 00:24 and May 27 07:00 - 08:47 (all times UTC). Git operations that were able to authenticate – around 70% of requests – completed successfully but were excessively slow.
Engineers ultimately resolved the issue by scaling up the number of servers and running more processes on each server.
For a small fraction of requests, the change added slow API calls from one internal legacy service to a second, newer service. The added load from these calls was effectively hidden from our existing auto-scaling logic that monitored CPU, as waiting for an API response would block a legacy service worker while using almost zero CPU time.
In this configuration, the service became unstable: if too many workers got stuck in slow API calls at the same time, CPU utilization would go down while the incoming connection pool would fill up. Low CPU utilization ensures that no new instances are added while a full connection pool causes internal healthchecks to fail, leading to instance churn and reducing available capacity even further.
Just after 07:00 UTC on May 27, during one of Bitbucket Cloud's periodic top-of-hour traffic peaks, too many workers got stuck in slow API calls, and caused a partial outage. The specific legacy service was soon identified as being the cause, but the low CPU utilization and lack of application-level errors was puzzling.
Redploying initially appeared to solve the problem, supporting the theory that there was some transient error that put things in a bad state, but the next top-of-hour spike triggered the same problem. Engineers finally resolved the issue by manually scaling up the number of instances, but it took a while to figure out why that worked.