A single DNS race condition brought AWS to its knees
October 23, 2025
Amazon has published a detailed postmortem explaining how a critical fault in DynamoDB’s DNS management system cascaded into a day-long outage that disrupted major websites and services across multiple brands – with damage estimates potentially reaching hundreds of billions of dollars.
The incident began at 11:48 PM PDT on October 19 (7.48 UTC on October 20), when customers reported increased DynamoDB API error rates in the Northern Virginia US-EAST-1 Region. The root cause was a race condition in DynamoDB’s automated DNS management system that left an empty DNS record for the service’s regional endpoint.
The DNS management system comprises two independent components (for availability reasons): a DNS Planner that monitors load balancer health and creates DNS plans, and a DNS Enactor that applies changes via Amazon Route 53.
Amazon’s postmortem says the error rate was triggered by “a latent defect” within the service’s automated DNS management system.
The race condition occurred when one DNS Enactor experienced “unusually high delays” while the DNS Planner continued generating new plans. A second DNS Enactor began applying the newer plans and executed a clean-up process just as the first Enactor completed its delayed run. This clean-up deleted the older plan as stale, immediately removing all IP addresses for the regional endpoint and leaving the system in an inconsistent state that prevented further automated updates applied by any DNS Enactors.
Before manual intervention, systems connecting to DynamoDB experienced DNS failures, including customer traffic and internal AWS services. This impacted EC2 instance launches and network configuration, the postmortem says.
The DropletWorkflow Manager (DWFM), which maintains leases for physical servers hosting EC2 instances, depends on DynamoDB. When DNS failures caused DWFM state checks to fail, droplets – the EC2 servers – couldn’t establish new leases for instance state changes.
After DynamoDB recovered at 2.25 AM PDT (9:25 AM UTC), DWFM attempted to re-establish leases across the entire EC2 fleet. The massive scale meant the process took so long that leases began timing out before completion, causing DWFM to enter “congestive collapse” requiring manual intervention until 5:28 AM PDT (12:28 PM UTC).
Next, Network Manager began propagating a huge backlog of delayed network configurations, causing newly launched EC2 instances to experience network configuration delays.
These network propagation delays affected the Network Load Balancer (NLB) service. NLB’s health checking subsystem removed new EC2 instances that failed health checks due to network delays, only to restore them when subsequent checks succeeded.
With EC2 instance launches impaired, dependent services including Lambda, Elastic Container Service (ECS), Elastic Kubernetes Service (EKS), and Fargate all experienced issues.
AWS has disabled the DynamoDB DNS Planner and DNS Enactor automation worldwide until safeguards can be put in place to prevent the race condition reoccurring.
In its apology, Amazon stated: “As we continue to work through the details of this event across all AWS services, we will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery.”
The prolonged outage affected websites and services over the space of the day. It also hit government services. Some estimates suspect the resultant chaos and damage may yet reach hundreds of billions of dollars. ®
Search
RECENT PRESS RELEASES
Related Post