On Oct 4th, 2022, between 7:17 AM and 8:12 AM UTC, Atlassian customers using Bitbucket Cloud were unable to access the website. The event was triggered by a configuration change that hit a bug in the envoy proxy version we are using for cluster management. Envoy proxy is a high performance, small footprint edge and service proxy and also load balancer. The incident was detected within 14 minutes by automated monitoring and on-call engineers were immediately paged. The incident was resolved by reverting the envoy config that caused the bug to be exposed, which mitigated the customer impact, getting our website into a good state. The total time to resolution was about 55 minutes.
The overall impact was between 7:17 AM and 8:12 AM UTC. The incident affected the Bitbucket Cloud website preventing users from accessing the website. Bitbucket Cloud API, Git over SSH, and Git over HTTPS were not impacted by this incident.
The issue was caused by a bug in envoy proxy where it does not handle hostnames in the Redis cluster command. We deployed the envoy proxy config for a new Elasticache Redis cluster which was planned to be used by other microservices in the near future. The envoy configuration change that resulted in the envoy bug did not affect existing Bitbucket website instances, so they were not affected by the bug after the envoy configuration deployment. Instead, the bug was exposed by the envoy config that was applied to new instances that were created by the scale-up event, so our website could not scale as new instances were failing to start since they were dependent on the envoy proxy to be up and running. This caused existing instances to hit high CPU utilization, fail health checks and eventually crash since they could not keep up with the increase in traffic to the Bitbucket Cloud website.
We know that outages impact your productivity. While we have a number of testing and preventative processes (post-deployment verification tests across all environments) in place, this specific issue wasn’t identified because the envoy proxy bug manifested itself only on newly provisioned instances of envoy proxy as part of the microservice, existing instances of the microservice were not affected by the bug.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support