
AWS has apologized after a massive outage affected thousands of third-party online services and dozens of AWS services for hours last week stemming from a capacity increase on Amazon’s Kinesis server fleet.
“We want to apologize for the impact this event caused for our customers,” said AWS. “We know how critical this service, and the other AWS services that were impacted, are to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”
On Nov. 25, the public cloud titan added capacity to its front-end fleet of Kinesis servers without checking if the operating system’s configuration allowed for it, which ultimately led to a significant outage that took approximately 17 hours before Kinesis was fully restored. AWS services like API Gateway, Amplify, AppStream2, Athena, Cloudtrail, Cloudwatch, Cognito, DynamoDB, EventBridge, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker and Workspaces were impacted.
Ethan Simmons, a managing partner for Pinnacle Technology Partners Inc., an AWS managed service provider with an impressive life sciences customer base, said none of his customers were impacted by the outage. A big reason for that is Pinnacle’s popular PeakPlus suite of secure managed and monitored AWS services and its adherence to AWS’ well-architected review standards.
Simmons, a 28-year IT veteran, said the outage highlights the need for well-architected AWS environments.
“It doesn’t matter whether it is on-premise or in the cloud, outages are going to happen, you always have to design for it,” said Simmons. “If you blindly think everything is going to function okay you are making a big mistake. You need to plan for it and have a partner that can help you architect the solution correctly. IT is complex and, if anything, it is getting more complex. Companies need a partner that can help them architect their environment.”
[Related: JavaScript, CSS, HTML Top List Of Most In-Demand Tech Skills]
Key to a well-architected AWS environment is taking advantage of all of the robust AWS redundant services, said Simmons. “When we bring on net new customers, we use AWS’ well-architected framework to make sure they have from day one, the right high availability and redundancy in place as part of the design.”
Amazon Kinesis is used by developers to capture data and video streams in order to process them through AWS’ machine learning platforms. After adding capacity to the Kinesis servers in the early morning hours on Nov. 25, the front-end fleet began to exceed the maximum number of threads allowed by its operating system configuration, according to a recent AWS blog post. This caused AWS’ US-EAST-1 region to go offline.
“The new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration,” said AWS. “As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.”
For the thousands of servers to communicate with one another, AWS Kinesis fleet needs “threads” between each other. When servers are added to the fleet, it can take hours for these “threads” to be created and recognized by existing servers. With the number of threads exceeding the OS configuration, the servers were not able to route requests to Kinesis back-end clusters.
AWS fixed the issue by rebooting all of Kinesis. It took several hours because “we can only add servers at a rate of a few hundred per hour,” said AWS.
AWS is already making several changes to make sure a similar outage doesn’t occur again including using larger CPU and memory servers, and reducing the total number of servers and threads required by each server to communicate across the fleet. “This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet. Having fewer servers means that each server maintains fewer threads,” AWS said.
Additionally, AWS is adding “fine-grained” alarming for thread consumption in the service as well as moving several large services, such as CloudWatch, to a separate front-end fleet. The company is also working on a larger project to isolate failures in one service so it doesn’t affect other services.
Steve Burke contributed to this article.
related stories
Video
trending stories
sponsored resources

CRN Showcase

APC by Schneider Electric
Digital Services for Edge Learning Center

Channel Chief Showcase

Comm100
Collaboration & Communications 360

Cradlepoint
5g for Business 360

Cato Networks
SASE & SD-WAN 360

Trend Micro
Trend Micro Learning Center

Veeam
Veeam

Acer
Remote Workforce 360

Partner Program Guide Showcase

NPD
Industry Trends 360

Comcast Business
Comcast Business Learning Center

Terranova Security
Cybersecurity 360

CyberPower
CyberPower

eSentire
Managed Detection and Response 360

EPOS
EPOS

Sherweb
Sherweb

Dell Technologies
Dell Technologies Cloud Learning Center

Dell Technologies
Microsoft HCI Solutions from Dell Technologies Learning Center

Dell Technologies
Dell Technologies Server Learning Center

Carbonite
Cloud Storage 360

VMware

HubStor
Cloud Backup 360

Wasabi
Wasabi

Cysurance
Cyber Insurance 360

Vertiv
Edge Computing Learning Center

Webroot
Webroot Learning Center

Tenable
Cyber Risk 360

Fujifilm
Fujifilm

Sophos
Sophos Cybersecurity Learning Center

Vonage
Vonage

BlackBerry
BlackBerry Learning Center

Cyber Protection 360

Application Integration 360

Hitachi Vantara
Hitachi Vantara

Smart 3rd Party
3rd Party Maintenance 360

SentinelONE
EndPoint Security 360

iboss
Cloud SASE Platform 360

Dell Technologies
Dell Technologies Storage Learning Center

Fortinet
Fortinet

Area 1 Security
Area 1 Security
