Bitbucket website and API services are returning 502 errors

Incident Report for Atlassian Bitbucket

Postmortem

SUMMARY

On June 30, 2021, between 5:05PM UTC and 5:15PM UTC, customers of Bitbucket experienced errors and increased latency browsing the site or using Bitbucket APIs. The event was triggered by a bug in auto-scaling configuration that resulted in under-provisioned services that back Bitbucket's web frontend. The incident was detected within 4 minutes by monitoring and was quickly mitigated by increasing capacity which put Atlassian systems into a known good state. The total time to resolution was about 14 minutes.

IMPACT

The overall impact was between 5:05PM UTC and 5:15PM UTC. Customers would have experienced errors or increased latency interacting with dashboards, pull requests, commits, or Pipelines. The scope of the impact was limited to the website and API services, not CLI (git over HTTPS or SSH). Average response times for website and API services were elevated by 300%.

ROOT CAUSE

The issue was caused by a change to auto-scaling configuration in the services that back Bitbucket's website and APIs. As a result, a scheduled scaling event aggressively reduced capacity for these services, which left them under provisioned.

REMEDIAL ACTIONS PLAN & NEXT STEPS

The Bitbucket team did an analysis of traffic patterns to align on the ideal scaling configuration, but we failed to select the right number of nodes to handle Bitbucket website throughput at all times of the day. This specific issue wasn’t identified earlier because it was not caught during automated pre-production testing.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Tuning auto-scaling configuration rules for Bitbucket website and API services
Reviewing auto scaling change management practices (and revising as necessary)

Furthermore, while we deploy our changes progressively to avoid broad impact, in this case our automated testing did not prevent this incident. To minimize the impact of future breaking changes to our environments, we will implement additional scaling testing measures.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve our platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jul 07, 2021 - 21:51 UTC

Resolved

This incident has been resolved.

Posted Jun 30, 2021 - 17:53 UTC

Monitoring

Both website and API services have fully recovered. We will continue to monitor our systems and resolve this incident shortly.

Posted Jun 30, 2021 - 17:31 UTC

Identified

We have increased capacity for website and API services and performance appears to be stabilizing.

Posted Jun 30, 2021 - 17:24 UTC

Investigating

We are currently investigating an issue with the Bitbucket website and API services.

Posted Jun 30, 2021 - 17:16 UTC

This incident affected: Website and API.