Bitbucket website and API outage

Incident Report for Atlassian Bitbucket

Postmortem

SUMMARY

On Oct 4th, 2022, between 7:17 AM and 8:12 AM UTC, Atlassian customers using Bitbucket Cloud were unable to access the website. The event was triggered by a configuration change that hit a bug in the envoy proxy version we are using for cluster management. Envoy proxy is a high performance, small footprint edge and service proxy and also load balancer. The incident was detected within 14 minutes by automated monitoring and on-call engineers were immediately paged. The incident was resolved by reverting the envoy config that caused the bug to be exposed, which mitigated the customer impact, getting our website into a good state. The total time to resolution was about 55 minutes.

IMPACT

The overall impact was between 7:17 AM and 8:12 AM UTC. The incident affected the Bitbucket Cloud website preventing users from accessing the website. Bitbucket Cloud API, Git over SSH, and Git over HTTPS were not impacted by this incident.

ROOT CAUSE

The issue was caused by a bug in envoy proxy where it does not handle hostnames in the Redis cluster command. We deployed the envoy proxy config for a new Elasticache Redis cluster which was planned to be used by other microservices in the near future. The envoy configuration change that resulted in the envoy bug did not affect existing Bitbucket website instances, so they were not affected by the bug after the envoy configuration deployment. Instead, the bug was exposed by the envoy config that was applied to new instances that were created by the scale-up event, so our website could not scale as new instances were failing to start since they were dependent on the envoy proxy to be up and running. This caused existing instances to hit high CPU utilization, fail health checks and eventually crash since they could not keep up with the increase in traffic to the Bitbucket Cloud website.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes (post-deployment verification tests across all environments) in place, this specific issue wasn’t identified because the envoy proxy bug manifested itself only on newly provisioned instances of envoy proxy as part of the microservice, existing instances of the microservice were not affected by the bug.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Redeploy Bitbucket Cloud service once envoy proxy config changes are deployed and run post-deployment verification tests in our pre-production environments to prevent this incident from re-occurrence. This will also ensure our post-deployment checks can run against our new configuration just after the deployment, without waiting for our website to scale to use the new configuration.
Upgrade envoy proxy to the most recent version of envoy proxy that has the fix for the bug in the Redis filter.
Faster rollbacks for the envoy control plane service that pushes config to envoy proxy to reduce our time to recovery (TTR) in the event this occurs again.
Improved and faster alerting and monitoring for detecting the degradation of microservice to reduce our time to detection (TTD) in the event this occurs again.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Oct 13, 2022 - 01:35 UTC

Resolved

This incident has been resolved.

Posted Oct 04, 2022 - 08:39 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 04, 2022 - 08:16 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 04, 2022 - 08:14 UTC

Investigating

We are currently investigating this issue.

Posted Oct 04, 2022 - 07:36 UTC

This incident affected: Website and API.