On Sept 25, 2022, between 01:46 PM and 09:19 PM UTC, some Atlassian customers using Bitbucket Cloud were unable to access their repositories. The event was triggered when our storage vendor experienced an outage at their data center. The outage was caused by a firmware upgrade resulting in a subset of their storage clusters failing to update correctly. The incident was detected within 14 mins by our on-call SRE team and was escalated to our storage vendor who began restoring the failed nodes to bring their storage services online again. The total time to resolution was seven hours and 33 minutes.
The overall impact was between 01:46 PM and 09:19 PM UTC on Bitbucket Cloud. The Incident caused service disruption to some of our users, causing affected customers to be unable to access repositories via the CLI, or browse the Bitbucket Cloud website.
The issue was caused by a firmware update to the nodes made by our storage vendor; few nodes failed to update and went down over a span of several hours. And as a result, the affected Bitbucket Cloud customers could not access their repositories and the users received HTTP 504 errors.
The root cause of the incident was the firmware update process, which did not properly update and restart all storage nodes.
We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this specific failure with the firmware upgrade process wasn't detected prior to deployment.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support