AWS’ 15-Hour Outage: 5 Big AI, DNS, EC2 And Data Center Keys To Know

From the cause and resolution of AWS’ massive outage to whether AI had a potential impact, here are five big things you need to know about AWS global outage that affected millions.

AWS’ massive 15-hour-long global outage Monday hit millions of people and businesses, affecting everything from payment services and financial trading applications to social media websites and commercial software.

“Things like this could happen to any public cloud provider, any private cloud provider—AWS, Microsoft, ourselves included with our own cloud platform,” said Robert Keblusek, chief innovation and technology officer at Downers Grove, Ill.-based Sentinel Technologies, a top-notch security firm and AWS partner.

“This is technology. It can be affected by human error. It can be affected by equipment failure. No matter how many safeguards you put in place, these things can happen,” Keblusek said. “AI is accelerating changes in cloud infrastructure,” he added.

[Related: Cloud Outages Will Increase ‘More And More’ Due To AI Usage After AWS Outage Rocks Over 1,000 Companies, Sats Tech CEO]

The root cause of AWS’ outage stemmed from a Domain Name System (DNS) error that prevented applications from locating the correct address for DynamoDB. AWS’ DynamoDB is a cloud database that stores user information and other critical data.

Over 4 million users reported issues on Downdetector related to AWS’ outage, according to Downdetector owner Ookla. The AWS outage reportedly affected over 1,000 companies.

By 6:01 p.m. EST Monday, AWS said all AWS services had returned to normal, although some services would still have issues for the next several hours.

CRN breaks down the exact timeline of the AWS outage, what went wrong, details about the AWS data center site that caused the error, as well as if an influx of AI technology could have played a role in the issue.

“As AI infrastructures get built out faster and then more and more load gets put upon them, certainly that could potentially increase the possibilities of things like this happening,” said Keblusek. “There’s a rapid amount of change going on in those data centers right now, and there’s a massive amount of workloads and requests going to those data centers right now. … AWS handled this very professionally.”

Here’s what every AWS partner, customer and user needs to know about AWS’ recent cloud outage.

Hacker in hoodie using laptop with glowing polygonal screen on dark background. Innovation, ai and hacking concept. Multiexposure

Was This A Cyberattack?

AWS has said the issue was not caused by a cyberattack or related to security.

The root cause of the AWS’ outage stems from the technical DNS issue.

“We identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints,” said Amazon via its AWS Health Dashboard.

The DNS issue triggered a waterfall effect that spread throughout AWS’ vast services portfolio that millions of users leverage every day, including internet websites, digital products and applications.

“There’s no sign that this AWS outage was caused by a cyberattack,” said Robert Jardin, chief digital officer at cybersecurity firm NymVPN, in a statement to CRN. "These issues can happen when systems become overloaded or a key part of the network goes down, and because so many websites and apps rely on AWS, the impact spreads quickly."

The cloud giant has not reported any evidence of external interference.

However, AWS hasn’t said what the exact cause was that created the initial DNS issue.

What And Where Is AWS US-EAST-1 Site?

The AWS issue that caused rippling effects worldwide began inside the company’s US-East-1 data center site.

AWS’ US-East-1 is the company’s oldest and largest site for its web services located in northern Virginia.

Northern Virginia is known to many as the world’s central data center hub due to the area being home to a large amount of data centers from various vendors because of its historical and strategic location near Washington, D.C.

The US-East-1 data center site suffered outages in 2021 and 2020 but were not nearly as impactful as this week’s outage.

A DNS error inside a data center can occur from a maintenance issue or a server failure, as well as human error.

“At the end of the day, a DNS issue can happen to anyone,” said Keblusek. “It’s happened before and very likely to happen again. … These outages that we’re looking at are few and far between. AWS runs really reliable infrastructure.”

Could AI Be A Hidden Culprit?

Amazon is one of the largest spenders on AI in the world, pouring tens of billions of dollars each year into new AI infrastructure, AI-focused data centers and AI services.

Bob Venero, CEO of Fort Lauderdale, Fla.-based Future Tech Enterprise, said there will be increased outages going forward with more AI usage from users and businesses.

“There are going to be more and more of them,” Venero said. “They are just going to continue to increase, especially as we see more AI capabilities being introduced into the enterprise.”

Sentinel Technologies’ CTO Keblusek said, “AI is accelerating changes in cloud infrastructure.

“As AI infrastructures get built out faster and then more and more load gets put upon them, certainly that could potentially increase the possibilities of things like this happening,” Keblusek said. “There’s a rapid amount of change going on in those data centers right now, and there's a massive amount of workloads and requests going to those data centers right now.”

However, Keblusek said the issue may simply be a human error.

“Could AI be the cause? Could it be the additional traffic? Could it be human error? I don't have that answer,” he said.

AWS said the company “will share a detailed AWS post-event summary” on the exact cause of the outage in the near future.

Futuristic background with hexagon shell and hole with binary code and opened lock. Hacker attack and data breach. Big data with encrypted computer code. Safe your data. Cyber internet security and privacy concept. 3d illustration

AWS’ 15-Hour Outage Timeline Explained

The AWS issue started at 2:49 a.m. EST on Oct. 20.

By 6:01 p.m. EST, AWS said all AWS services had returned to normal, although some services would still have issues for the next few hours.

This means the outage lasted roughly 15 hours on Monday.

“Between 11:49 PM PDT on October 19 and 2:24 AM PDT on October 20, we experienced increased error rates and latencies for AWS Services in the US-EAST-1 Region. Additionally, services or features that rely on US-EAST-1 endpoints such as IAM and DynamoDB Global Tables also experienced issues during this time,” said Amazon via its AWS Health Dashboard.

In the early morning hours of Oct. 20, AWS resolved the DynamoDB DNS issue.

However, issues then began occurring in the “internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB,” AWS said.

In addition, Network Load Balancer health checks also “became impaired, resulting in network connectivity issues in multiple services” including Lambda, DynamoDB and CloudWatch.

“By 3:01 p.m. PT, all AWS services returned to normal operations,” AWS said.

Some services such as AWS Config, Redshift, and Connect would continue to have a backlog of messages that AWS would finish processing over the next few hours, according to the AWS Health Dashboard.

Former AWS Exec Weighs In; Amazon Stock Rises

Corey Beck is a former AWS Industries senior solutions architect who left AWS this year to become director of cloud technologies for DataStrike.

“Networking is certainly a foundational component of AWS services. When it stumbles in a region like US-East-1, the effects go way beyond; it ripples through EC2, S3, DynamoDB, RDS, and pretty much every service that depends on them,” said Beck in an email to CRN.

“You have to design with failure in mind because it’s going to happen,” he said. “Resilient systems aren’t about avoiding failure, they’re about making sure your customers barely notice when it does.”

He said just moving workloads to the cloud isn’t enough.

“Real resilience takes planning, multi-region design, regular testing and a mindset that assumes things will break,” Beck said. “That’s what separates a minor hiccup from a full-blown outage.”

It is key to note that the market at large doesn’t seem to think Amazon will be impacted by the global outage.

Interestingly, Amazon stock (AMZN) wasn’t affected on Monday.

In fact, Amazon’s stock is currently trading up 2 percent Tuesday at around $221 per share.