Partial outage
Incident Report for Atlassian Bitbucket
Postmortem

SUMMARY

On July 13, 2021 between 1:18 PM and 2:28 PM UTC, customers on Atlassian’s Cloud Platform using Bitbucket began to experience latency browsing the site as well as timeout errors. The event was triggered by an increase in traffic and a misconfigured auto-scaling policy that resulted in the website service being under provisioned. The incident was detected within 1 minute by automated monitoring and mitigated by manually scaling up the website service which mitigated the problem. The total time to resolution was 1 hour & 10 minutes.

IMPACT

The overall impact was between 1:18 PM and 2:28 PM UTC affecting the Bitbucket Cloud website. Average website response times increased 3x, and customers began to experience latency and errors browsing the site. In some cases customers would have seen a generic 502 HTTP error message when attempting to load a webpage. Git services over HTTPS and SSH were not impacted.

ROOT CAUSE

The issue was caused by misconfigured autoscaling policy and website service health check settings. As a result, website nodes were marked as unhealthy prematurely, which reduced capacity available to handle traffic.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified earlier because the existing autoscaling configuration had been working under various workloads for an extended period of time.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Tuning the Bitbucket website's autoscaling policy to enable it to scale up capacity more aggressively during increased traffic
  • Increasing the number of available connections to our website's reverse-proxy server so it doesn't become a bottleneck
  • Refining our website servers' health check thresholds to prevent capacity being inadvertently removed during periods of high traffic
  • Auditing autoscaling configuration across all Bitbucket web services

We apologize to those customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jul 20, 2021 - 16:53 UTC

Resolved
This incident has been resolved.
Posted Jul 13, 2021 - 16:07 UTC
Monitoring
After scaling up website services we have recovered and the Bitbucket website is operational again. We will continue to monitor performance and provide an update in the next hour.
Posted Jul 13, 2021 - 15:11 UTC
Identified
After initial investigation, we narrowed down the customer impact to the website. Git services over SSH or HTTPS were not impacted. We have scaled up website services and appear to be recovering. We are continuing to monitor performance and will provide another update within the hour.
Posted Jul 13, 2021 - 14:54 UTC
Investigating
We are investigating reports of intermittent errors for some of our Atlassian Bitbucket Cloud customers. We will provide more details once we identify the root cause.
Posted Jul 13, 2021 - 13:30 UTC
This incident affected: Website, API, Git via SSH, Authentication and user management, Git via HTTPS, Webhooks, Pipelines, and Signup.