Q&A: Why did the internet ‘cloud’ burst, and will it happen again?

November 5, 2025

On Oct. 20, a routine update at Amazon Web Services, one of the world’s largest providers of virtual computer services, exposed an existing software flaw within the systems and temporarily broke much of the internet, disrupting consumer access to banks, social media, shopping sites and even dating apps.

Services were down for the better part of a day, with some sites taking longer to recover.

To find out how and why the outage occurred, and what users can do to avoid future outages, UVA Today reached out to Neal Magee, the faculty director of systems architecture and an associate professor at the University of Virginia School of Data Science.

Q. What is the cloud, and why is it there?

A. The word “cloud” can mean different things to different people. For my own work, which relates to cloud services, the public cloud means providers like AWS, Google, Microsoft and others. They provide computing, storage and other services on demand for a fee.

Neal Magee lecturing in front of a classroom of students.

Neal Magee is the faculty director of systems architecture and an associate professor at the UVA School of Data Science. (Contributed photo)

So instead of buying a $40,000 server, waiting for it to be built and delivered, and then installing it, I can spin up a server in AWS within seconds with no long-term commitment. I could run it for five minutes or five months.

What’s great about that model is that you trade capital investments in infrastructure for ongoing, operational expenses of what you use. And you’re not stuck with servers the way you created them, either. You can resize to have more or less memory, CPU, GPU or storage. You can create computing resources that only run for five minutes every day, and you don’t pay for anything more than that.

AWS was created partly out of the need Jeff Bezos had to keep Amazon.com up and running when it experienced heavy user traffic. Back in the old days, Amazon sold mostly books, and Christmas shopping would cripple the site in November and December. So, they set up massive data centers around the globe in “regions,” with thousands of large, robust physical servers.

Q. Why did an outage at one company crash the internet for so many businesses, services and people?

A. From the start, AWS taught a series of design principles for how to build in the cloud, and one of the most fundamental is that you should always expect failure. Sometimes the power goes out. Sometimes your internet drops out or your air conditioning stops working. Those same things can affect data centers and the services that run in them.

It appears the event a couple of weeks ago involved a very low-level service called DNS that many other AWS services rely on and which, in turn, all consumer services rely on. It failed, causing a “cascading outage,” where one system fails and then other systems that depend upon that first system fail as well.

Q. How can companies prevent or lessen the impact of such outages?

A. The way to mitigate this is to build your solutions anticipating failure at any level. AWS has the concept of regions, and each region is made up of sub-regions, or “availability zones.” US-East-1 is Amazon’s Eastern region; it’s the oldest and largest region, with seven availability zones in it. Each availability zone is made up of more than one distinct data center, so you’re talking about a massive array of computing infrastructure.

Search

RECENT PRESS RELEASES