Atlassian Cloud products, applications, the customer experiences they provide, as well as supporting services and Atlassian’s own internal tooling rely on compute and data workloads deployed in Amazon Web Services (AWS) Virtual Private Cloud (VPC).
Critical to the function of these applications is the network communication between them, as well as to the Internet and AWS-managed services. This communication depends on DNS resolution, which applications perform by querying a DNS server to translate service names into network addresses they can connect to.
To facilitate DNS resolution across multiple accounts and VPCs, Atlassian operates centralized EC2 (Elastic Compute Cloud) DNS servers in each AWS region we serve customers from.
Atlassian is currently in the process of migrating from this centralized infrastructure to a more resilient distributed solution that leverages AWS-managed VPC resolver in a Shared VPC architecture. However, a number of internal applications and services still rely on the EC2-based solution.
The EC2-based DNS servers utilize security groups, which are subject to connection tracking allowance limitsinstrumented by AWS.
On November 22, 2022, between 19:39 and 21:28 UTC, some Jira, Confluence, and Bitbucket customers experienced varying levels of degradation across our products and services, including partner apps and integrations. The incident was detected within nine minutes by an automated monitoring system and resolved within one hour and 49 minutes.
The event was triggered by Atlassian’s DNS infrastructure within the AWS us-east-1 region encountering a network connection tracking allowance limit. Despite the fault being localized to one region, there was a global impact for customers whose data resided in us-east-1, as well as Atlassian products which have unique dependencies on that region.
The issue was initially mitigated by scaling up the DNS infrastructure, which increased the connection tracking allowance limit and returned Atlassian products to a healthy state. The underlying issue has since been resolved by a configuration change and cannot reoccur.
The incident was not caused by an attack and it was not a security issue. Atlassian customer data remains secure.
We deeply value the trust placed in Atlassian and apologize to customers who were affected by the event.
The impact occurred on November 22, 2022, between 19:39 and 21:28 UTC, for a total of one hour and 49 minutes. Some Jira, Confluence, and Bitbucket customers experienced service degradation ranging from intermittent access to a complete outage. Customers experienced error pages or very slow responses in applications and browsers, as well as 5xx response codes from our APIs.
The incident occurred as connections through Atlassian’s DNS infrastructure reached a new daily peak due to steady traffic growth. As a result, a limit on the number of simultaneous network connections tracked by an AWS EC2 security group was encountered. Upon reaching this limit, DNS packets were dropped, meaning services were unable to resolve network addresses which resulted in application failures. Services retried their DNS queries upon receiving a SERVFAIL response or query timeout, which created even more connections - compounding the problem.
During the incident, Atlassian’s DNS infrastructure in the AWS us-east-1 region was unable to service up to 90% of DNS resolution requests.
We were not aware of how close our infrastructure was to the security group connection tracking limit because utilization of this allowance is not currently observable.
Troubleshooting took an extended period of time because EC2 network allowance packet drops (as a result of encountering this limit) were not actively monitored by Atlassian.
Atlassian acknowledges that outages like this one impact the productivity and business of our customers. Since this incident:
Again, we apologize to those customers whose services were impacted during this incident.
Atlassian Customer Support