On June 30, 2021, between 5:05PM UTC and 5:15PM UTC, customers of Bitbucket experienced errors and increased latency browsing the site or using Bitbucket APIs. The event was triggered by a bug in auto-scaling configuration that resulted in under-provisioned services that back Bitbucket's web frontend. The incident was detected within 4 minutes by monitoring and was quickly mitigated by increasing capacity which put Atlassian systems into a known good state. The total time to resolution was about 14 minutes.
The overall impact was between 5:05PM UTC and 5:15PM UTC. Customers would have experienced errors or increased latency interacting with dashboards, pull requests, commits, or Pipelines. The scope of the impact was limited to the website and API services, not CLI (git over HTTPS or SSH). Average response times for website and API services were elevated by 300%.
The issue was caused by a change to auto-scaling configuration in the services that back Bitbucket's website and APIs. As a result, a scheduled scaling event aggressively reduced capacity for these services, which left them under provisioned.
The Bitbucket team did an analysis of traffic patterns to align on the ideal scaling configuration, but we failed to select the right number of nodes to handle Bitbucket website throughput at all times of the day. This specific issue wasn’t identified earlier because it was not caught during automated pre-production testing.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
Furthermore, while we deploy our changes progressively to avoid broad impact, in this case our automated testing did not prevent this incident. To minimize the impact of future breaking changes to our environments, we will implement additional scaling testing measures.
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve our platform’s performance and availability.
Thanks,
Atlassian Customer Support