Issues with Pull Request
Incident Report for Atlassian Bitbucket
Postmortem

SUMMARY

On May 11, 2021, between 06:19 and 09:34 UTC, customers of Bitbucket Cloud were unable to perform various tasks including creating and merging pull requests. The event was triggered by the deployment of incorrect configuration to Bitbucket’s services responsible for handling background tasks. The changes included incorrect URLs for some internal resources which impacted all customers using the website and API services. The incident was detected within 20 minutes by automated verification tests and mitigated by routing production traffic to another set of infrastructure, then fixing the problematic configuration which put our systems into a known good state. The total time to resolution was 3 hours and 15 minutes.

IMPACT

The incident caused service disruption to all customers using Bitbucket’s website and API services attempting to perform operations that require background processing. These operations include creating repositories and pull requests, merging or declining pull requests, and triggering Pipelines builds, as well as indexing code for search and powering notifications.

Within approximately 40 minutes, the Bitbucket engineering team mitigated the impact by redirecting traffic to a separate infrastructure with working configuration. At this point, around 07:00 AM UTC, users were once again able to perform all operations, but website and API performance became severely degraded. Traffic was finally routed back to Bitbucket’s primary infrastructure with its configuration fixed by 09:34 AM UTC, at which point performance was restored to normal levels.

ROOT CAUSE

The issue was caused by a change to the config file for Bitbucket’s service responsible for executing asynchronous tasks. Typically changes like these are detected by post-deployment verification checks that are executed automatically against a staging environment before being promoted to production. In this case, the problematic configuration was not applied to the existing staging environment but to a new environment intended to replace the current staging environment. Because post-deployment checks were not configured to validate this new environment along with the current environment as part of the release process, the broken configuration was not detected until after being applied in production.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue was not identified in advance due to our teams' efforts to move to a new infrastructure while porting existing verification and validation tooling to target this new infrastructure in parallel.

In response to this incident, we are prioritizing the following actions to avoid repeating this type of incident:

  • We are updating our release pipeline to trigger deployments to both old and new staging environments, validating old and new configurations, as a prerequisite for production deployments.
  • In addition, moving forward post-deployment verification checks will be executed against both old and new staging environments until the old staging environment is fully decommissioned.

We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted May 19, 2021 - 14:50 UTC

Resolved
This incident has been resolved.
Posted May 11, 2021 - 10:35 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 11, 2021 - 09:43 UTC
Update
We are continuing to investigate this issue.
Posted May 11, 2021 - 08:45 UTC
Update
We're still investigating
Posted May 11, 2021 - 07:19 UTC
Investigating
We are investigating cases of degraded performance with Pull Requests for some of our Atlassian Bitbucket Cloud customers. We will provide more details within the next hour.
Posted May 11, 2021 - 06:42 UTC
This incident affected: Website, API, Webhooks, and Pipelines.