AWS outage: Myths vs reality

October 27, 2025

Column AWS put out a hefty analysis of its October 20 outage, and it’s apparently written in a continuing stream of consciousness before the Red Bull wore off and the author passed out after 36 straight hours of writing.

I’m serious here. It’s to the point where if I included a paragraph half as long as some of these, El Reg’s editor would hit me with his belt. Go read the thing but be sure to pack a lunch – it’s rather dense.

I want to call out a few points that I’ve seen folks making. Some of these are driven by simple ignorance, while others by a form of malicious salesmanship or desire for attention. I too desire attention, but you’ve gotta earn it, so let’s see what we can do here.

No, this was not caused by AI

First, “automated systems” are how everything of even small scale is run in technology. Once upon a time, a valid question was “how many servers can a sysadmin handle.” Today the answer is either “all of them,” or else you’re doing it wrong.

“When thing X happens, do thing Y” is how computers basically work, and it’s about as much “AI” as it is a breakfast cereal. In fact, you’ll note that incident analysis is the first thing that Amazon has put out in years that didn’t even mention Generative AI, which is a veritable breath of fresh air.

Multi-cloud is for the rubes

“We can’t handle this type of outage so we’re going to expand to multiple cloud providers” is the rallying cry of optimistic fools.

A single AWS region is a single point of failure. With a lot of dedicated work, you can add another cloud provider or AWS region or datacenter into the mix until finally, at tremendous effort and expense, you have added a second single point of failure. Good work; now you’re subject to your existing issues with us-east-1, but you’ve added in your workload having problems whenever Azure hiccups too.

Along the way you’ve added incongruent building blocks that require you to either lowest-common-denominator your way toward building them at parity, or else you’re running an amalgamation of nightmares that might work in theory (“theory” being the name of my staging environment) but is unlikely to work in production.

The long tail is quite short

If you’ve ever fixed a family member’s computer, you’re well acquainted with the “last person to touch it owns it” effect. If you helped your cousin fix their Wi-Fi network two years ago, their printer breaking this week is clearly because of something you did.

So, to be clear, no, this issue did not have security implications for your account, no, it’s not why your app isn’t working this week, and no, DynamoDB did not break into your house while you were passed out intoxicated and drunk-text all your exes.

There were availability issues in us-east-1 last Monday, and some of your vendors were still remediating it after the official incident was over, but that was measured in hours instead of days. If you’re still having issues a week later, open a ticket with the relevant parties, because it’s not this incident anymore. My apologies if this eviscerates a handy excuse you were hoping to use to coast through the holiday season.

Coulda woulda shoulda

Any jackass can look at a painstakingly built analysis like this and say what AWS should have done instead, and indeed in comment threads around the internet many jackasses have done precisely this. Everything at scale is complex, and things that are blatantly obvious in hindsight (“maybe we shouldn’t let systems automatically remove some extremely critical records, regardless of what they think they’re accomplishing”) are only obvious because a vanishingly rare event occurred.

Those types of checks flourish everywhere, but the trouble with enumerating things that shouldn’t happen is that the list is literally infinite; it’s a reactive mindset and you’ll forever spend your days playing Whac-A-Mole if that’s how you choose to approach it.

It’s always DNS

Well, no kidding; DNS is effectively the phone number that translates names to numbers and back again for the entire internet. Without it, an awful lot of systems have no clue where other systems live, and a computer without network access is increasingly an expensive space heater. How many phone numbers do you know by heart if your mobile eats your contact list?

Widespread issues trend toward DNS because it is just that important to how the entirety of modern computing works today. I joke about it being a database, but it’s absolutely a key-value store. Even various service meesh (that’s how I pluralize “mesh” and I refuse to change it for the likes of you) are jumped-up DNS servers with delusions of grandeur.

So yes, “it’s always DNS” is a half-step away from “this outage is caused by computers.” This will not be the last time DNS breaks something important and expensive, like a hyperactive golden retriever in a physical therapy clinic.

This isn’t common

“us-east-1 is a tire fire of sadness and regret” used to resonate; it doesn’t really hit the same way anymore. You’d be hard-pressed to identify any environment with a more robust uptime than AWS does collectively. As of this writing, they have never had a global outage, due to their (sometimes infuriating) regional isolation. Google has a shared control plane that has led to several rolling global outages, and Microsoft is convinced that “uptime” should be two words, probably in someone else’s account.

The world is more interconnected today than it ever has been, while that interconnection has become increasingly centralized. AWS is orders of magnitude better at systems reliability than you or I are, but because their outage blast radius is “a significant part of the global economy” their blips are felt far more heavily, and all at the same time. They employ some of the best engineers on the planet to think about these problems at a scale that few of us can really contextualize, and the sheer fact that an outage like this is as newsworthy as it clearly has become is a testament to their excellent work.

But it’s still fun to poke the $2.5 trillion company in the eye with a sharp stick every now and then. ®

Search

RECENT PRESS RELEASES

No, this was not caused by AI

Multi-cloud is for the rubes

The long tail is quite short

Coulda woulda shoulda

It’s always DNS

This isn’t common

The Game-Changing Potential of Balcony Solar

Algorand Q&A: Renewable Energy for Sustainable Blockchains

Algorand Q&A: Renewable Energy for Sustainable Blockchains

Algorand Q&A: Renewable Energy for Sustainable Blockchains

ENGIE and Meta expand Power Purchase Agreements to more than 1.3 GW in U.S with addition o

Will Plug Power’s (PLUG) Renewable Fuels Expansion Redefine Its Clean Energy Strategy?