Microsoft Azure failure explained: configuration changes sent systems into infinite loop
Changes to storage configuration crashed the system and also took down monitoring dashboards
On Wednesday, beginning at 00:52 GMT, Microsoft's cloud computing platform Azure was hit with a fault that took many third-party sites offline and also affected a total of 20 applications including Microsoft's Office 365 online apps, virtual machines, backup, machine learning, HDInsights and the company's cloud-based Hadoop distribution. The disruption lasted for 11 hours and was felt in the US, Europe and parts of Asia.
The cause of this disruption has now emerged. Writing on a Microsoft blog Jason Zander, CVP of the Microsoft Azure team, says the culprit was a change in the configuration of Azure that was intended to improve the performance of the the storage of unstructured data (or blob storage), but which instead threw the system into an infinite loop.
"During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting [pre-implementation testing]," writes Zander.
"The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues."
The problem was compounded by the fact that the Service Health Dashboard and Azure Management Portal, via which customers and Microsoft can check the functioning of their services, both depend on Azure Storage services, meaning that neither reflected the true picture of what was occurring.
Once the error had been detected, Microsoft says the changes were rolled back. However, the storage front ends had to be restarted in order to take the system back to its previous state, resulting in continued disruption.
Zander insists that "services are generally back online", although "a limited subset of customers are still experiencing intermittent issues", which he says Microsoft's engineers are working to rectify.
Writing on Petri IT Knowlegebase, Microsoft MVP Adrian Finn was critical of his organisation's response, in particular how the issue was communicated to customers
"At my place of work, we had calls from customers that didn't understand what was happening. Virtual machines had crashed and wouldn't start. They hadn't heard from Microsoft. Microsoft's Azure Status site was claiming that things were just dandy during the first three hours of the issue; it was affected by the issue too! Microsoft did tweet about it and used other social media ... but that's hardly alerting, is it?" Finn said.
A previous Azure outage in July 2012 was also blamed on a configuration error.