On July 13, 2021 between 1:18 PM and 2:28 PM UTC, customers on Atlassian’s Cloud Platform using Bitbucket began to experience latency browsing the site as well as timeout errors. The event was triggered by an increase in traffic and a misconfigured auto-scaling policy that resulted in the website service being under provisioned. The incident was detected within 1 minute by automated monitoring and mitigated by manually scaling up the website service which mitigated the problem. The total time to resolution was 1 hour & 10 minutes.
The overall impact was between 1:18 PM and 2:28 PM UTC affecting the Bitbucket Cloud website. Average website response times increased 3x, and customers began to experience latency and errors browsing the site. In some cases customers would have seen a generic 502 HTTP error message when attempting to load a webpage. Git services over HTTPS and SSH were not impacted.
The issue was caused by misconfigured autoscaling policy and website service health check settings. As a result, website nodes were marked as unhealthy prematurely, which reduced capacity available to handle traffic.
We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified earlier because the existing autoscaling configuration had been working under various workloads for an extended period of time.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to those customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support