Massive AWS outage was caused by adding new servers to Kinesis

Red-faced Amazon says it will apply lessons learned to improve the reliability of its services

Amazon Web Services (AWS) has revealed the actual cause of the massive outage that impacted thousands of online sites and services, including Amazon's own services, last week.

According to the company, the outage was not driven by any memory issue in the network. Rather, it was triggered by the addition of new servers to the Amazon Kinesis real-time data processing service.

Adding new capacity caused all servers in the Kinesis system to exceed the maximum number of 'threads' allowed by an operating system (OS) configuration.

To communicate with each other, servers in the Kinesis system need to generate threads between each other in the front-end fleet.

The Kinesis system already has "thousands of servers", according to AWS, and when new machines were added, the maximum limit of thread count allowed by the OS configuration was exceeded.

This issue resulted in a series of other problems that eventually took down thousands of websites and services, including those from some big companies such as Adobe, Flickr, Roku, Twilio and Autodesk.

AWS's own services were also affected, including ACM, Amplify Console, AppStream2, AppSync, Athena, Batch, CodeArtifact, CodeGuru Profiler, CodeGuru Reviewer, CloudFormation, CloudMap, CloudTrail, Connect, Comprehend, DynamoDB, Elastic Beanstalk, EventBridge, GuardDuty, IoT Services, Lambda, LEX, Macie, Managed Blockchain, Marketplace, MediaLive, MediaConvert, Personalize, RDS Performance Insights, Rekognition, SageMaker and Workspaces.

The multi-hour outage affected the US-East-1 region, according to the company.

The problem was fixed by rebooting the entire Kinesis service, which took a while to complete.

Amazon has apologised for the outage and said it would apply lessons learned to further improve the reliability of its services.

In the short term, the company plans to move to servers with more powerful CPUs and more and memory to help it reduce the number of servers and the thread count across the fleet.

It is also carrying tests to increase thread count limits in OS configuration. AWS believes the measure will give additional safety margin by providing more threads per server.

The company also plans to introduce lots of other changes to "radically improve the cold-start time for the front-end fleet".

"We are moving the front-end server cache to a dedicated fleet. We will also move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet," AWS said.

"In the medium term, we will greatly accelerate the cellularisation of the front-end fleet to match what we've done with the back-end."