A single point of failure triggered the Amazon outage affecting millions

October 24, 2025

A DNS manager in a single region of Amazon’s sprawling network touched off a 16-hour debacle.

Credit:
Getty Images

The outage that hit Amazon Web Services and took out vital services worldwide was the result of a single failure that cascaded from system to system within Amazon’s sprawling network, according to a post-mortem from company engineers.

The series of failures lasted for 15 hours and 32 minutes, Amazon said. Network intelligence company Ookla said its DownDetector service received more than 17 million reports of disrupted services offered by 3,500 organizations. The three biggest countries where reports originated were the US, the UK, and Germany. Snapchat, AWS, and Roblox were the most reported services affected. Ookla said the event was “among the largest internet outages on record for Downdetector.”

It’s always DNS

Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers’ control. The result can be unexpected behavior and potentially harmful failures.

In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it “experienced unusually high delays needing to retry its update on several of the DNS endpoints.” While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them.

The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB. As Amazon engineers explained:

When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. Therefore, this did not prevent the older plan from overwriting the newer plan. The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. As this plan was deleted, all IP addresses for the regional endpoint were immediately removed. Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors. This situation ultimately required manual operator intervention to correct.

The failure caused systems that relied on the DynamoDB in Amazon’s US-East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected.

The damage resulting from the DynamoDB failure then put a strain on Amazon’s EC2 services located in the US-East-1 region. The strain persisted even after DynamoDB was restored, as EC2 in this region worked through a “significant backlog of network state propagations needed to be processed.” The engineers went on to say: “While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation.”

In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center.

For the time being, Amazon has disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide while it works to fix the race condition and add protections to prevent the application of incorrect DNS plans. Engineers are also making changes to EC2 and its network load balancer.

A cautionary tale

Ookla outlined a contributing factor not mentioned by Amazon: a concentration of customers who route their connectivity through the US-East-1 endpoint and an inability to route around the region. Ookla explained:

The affected US‑EAST‑1 is AWS’s oldest and most heavily used hub. Regional concentration means even global apps often anchor identity, state or metadata flows there. When a regional dependency fails as was the case in this event, impacts propagate worldwide because many “global” stacks route through Virginia at some point.

Modern apps chain together managed services like storage, queues, and serverless functions. If DNS cannot reliably resolve a critical endpoint (for example, the DynamoDB API involved here), errors cascade through upstream APIs and cause visible failures in apps users do not associate with AWS. That is precisely what Downdetector recorded across Snapchat, Roblox, Signal, Ring, HMRC, and others.

The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design.

“The way forward,” Ookla said, “is not zero failure but contained failure, achieved through multi-region designs, dependency diversity, and disciplined incident readiness, with regulatory oversight that moves toward treating the cloud as systemic components of national and economic resilience.”

 

Go to Top