AWS S3 outage blamed on employee's typo
Firm say it will make "several changes" to prevent recurrence
AMAZON HAS REVEALED that a simple typo was behind the massive Amazon Web Services (AWS) outage earlier that downed websites earlier this week.
The outage struck on Wednesday, and caused widespread headaches, affecting - to name but a few - Adobe's cloud services, Amazon's Twitch, Docker, GitHub, iFixit, Kickstarter, Slack and Yahoo Mail. The downtime also floored Is It Down Right Now, and left Amazon unable to update its own AWS status dashboard.
A number of Internet of Things (IoT) devices were also hit by the outage, with users complaining that they were unable to switch off their lights and and even ovens.
Amazon at the time blamed the fault on issues with its widely-used S3 storage service in the major US-East-1 region of data centres in Northern Virginia.
The firm has since elaborated on the cause of the outage, and has revealed that nothng more than an employee typo was to blame.
"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorised S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," the firm said in a blog post.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."
AWS noted that it's making "several changes" as a result, including steps that would prevent an incorrect input from triggering such problems in the future.
"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly.
"We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."