AWS blames 'network congestion' for this month's second outage
It was the second outage of the month, affecting many sites and services
Amazon's cloud computing service AWS has said that network congestion between parts of the AWS Backbone and a subset of Internet Service Providers was to blame for the outage that took down many popular sites and services including Twitch, Netflix, and Hulu for about one and half hour on Wednesday, 15 December.
AWS stated on its status dashboard that many customers experienced elevated network pack loss between 7:14 AM PST (3:14 pm GMT) and 7:59 AM PST (3:59 pm GMT) that impacted connectivity to a subset of Internet destinations.
"The issue was caused by network congestion between parts of the AWS Backbone and a subset of internet service p[roviders, which was triggered by AWS traffic engineering, executed in response to congestion outside of our network," the company said.
This caused more traffic moving to parts of the AWS Backbone than expected, impacting connectivity to a subset of Internet destinations.
The issue impacted the company's US-WEST-1 Region in Northern California and US-WEST-2 region in Oregon, but was eventually resolved by the AWS engineers.
"Traffic within AWS Regions, between AWS Regions, and to other destinations on the Internet was not impacted," the company added.
While the problems were brief this time, unlike the massive outage earlier this month, they once again affected major service providers like Netflix, Twitch, DoorDash, Starbucks, and Snapchat as well as Amazon-owned companies including Twitch and Ring.
Twitch stated on Wednesday that several issues affected its services, and that the company was working to resolve them. About an hour later, it said that the issues had been fully resolved.
The brief outage on Wednesday came about a week after a massive outage on 7 December that knocked out a large number of websites, apps and streaming platforms worldwide.
That outage affected internal tools at the company as well, including the Flex and AtoZ apps that are used by warehouse and delivery workers, making it impossible for them to scan packages or access delivery routes.
In a brief summary of the incident, AWS explained that the problem began after "an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behaviour from a large number of clients inside the internal network."
That, in turn, created a huge spike in connection activity, overwhelming the networking devices between the main AWS network and the internal network and causing delays for communication between these networks. As a result, the latency and errors for services communicating between the networks increased, leading to even more connection attempts and retries.
The issue even impacted the company's ability to see what exactly was going wrong with the system.
According to Amazon, engineers in the operations team were prevented from using the real-time monitoring system and internal controls that they typically rely on.
Amazon says it expects to release a new version of Service Health Dashboard early next year that will make it easier for the company to understand service impact.
The firm also plans to release a new support system architecture that will actively run across multiple AWS regions, enabling AWS to communicate with its customers without delay.