How Blizzard autoscaled Overwatch on OpenStack
Cloud engineers at Blizzard reduced the game's footprint by 40 per cent
Blizzard Entertainment is the games development studio behind online multiplayer titles such as World of Warcraft and Diablo. These are deployed on a private cloud infrastructure across 11 global data centres, using virtual gaming servers running on OpenStack. Blizzard has been using OpenStack since 2012.
At the Open Infrastructure Summit in Denver, Colorado cloud engineers Duc Truong and Jude Cross explained how they had improved the performance of the game Overwatch and reduced the footprint of their virtual machines (VMs) by autoscaling using Senlin, a clustering service for OpenStack clouds.
Reducing the VM footprint
Running popular multiplayer games like Overwatch - which has 40 million registered users - in the cloud presents some unique challenges when it comes to optimising performance and resources. First, load fluctuates markedly depending on when most users are online. In addition, some players are active for just a few minutes while hardcore gamers play for many hours at a time. This means VMs cannot easily be taken down for fear of booting players out of a game. The result is that infrastructure tends to be overprovisioned.
Secondly, the duo explained, autoscaling (automatically increasing or reducing the number of VMs according to demand) traditionally relies on metrics such as CPU and RAM to decide when to scale up or down, but the dynamics of online gaming means that all available RAM is held in a VM in reserve, and there are spikes of CPU activity depending on the number of gamers connected, so these factors are not good guides.
Autoscaling is a challenge, therefore, but it was a capability that Truong's and Cross's bosses wanted nevertheless, to make better use of their resources and to provide a better service to app developers.
"In a private cloud we have a finite capacity, and without autoscaling that capacity gets divided up among the different games," Truong explained. "Each game is allocated a number of VMs to be able to handle the peak traffic for that game, including tournaments, but during non-peak times the load is spread over the VMs, and so each VM is underutilised."
With autoscaling, the load gets packed more tightly into existing VMs, leaving more spare capacity in the system that can be used for other things, he added.
There are other advantages too. With autoscaling, the lifetime of a VM can be a lot shorter, which in turn leads to more efficient operations and fewer bugs.
However, Senlin out of the box was not best suited to the peculiarities of autoscaling games servers particularly when scaling down, given that a player could be active in a VM for several hours. Truong and Cross had to make a number of changes, the first of which was to introduce lifecycle hooks into Semlin, to make sure nodes have been fully drained before they are deleted.
Because CPU and RAM are unreliable metrics, they also had to devise a way of letting each game calculate when to scale up or down autonomously, based mainly on the number of current users.
Creating the autoscaler
The implementation has three main components, Senlin which runs in a Docker container in the control pane, Zaqar the OpenStack messaging service which runs in another container in the control pane and provides the lifecycle hooks for Senlin, and a custom autoscaling service which runs in a VM and also integrates into AWS public cloud for when further capacity is needed.
The custom autoscaler tells Senlin to add VMs when the load reaches 60 per cent of the peak, and to delete nodes (after draining them) when it drops to 40 per cent. It also pushes metrics and state to a time series database - which was another area that caused problems.
The existing setup had to be optimised to handle heavy traffic and to cope with simultaneous scaling requests, which caused the database to lock. The database locking issue was dealt with by changes to the API, while the number of calls was drastically pruned.
"We updated the DB model and removed unnecessary DB calls - we reduced some of those calls by 1,000 per cent," said Cross. "That might sound like hyperbole, but it's not."
The action tables and events tables in Senlin also grew quickly causing a performance bottleneck, until a routine was introduced to purge older records every seven days.
Another hurdle was matching the asynchronous actions in Senlin with the synchronous failure checks by the autoscaler which caused conflicts. A tweak to the Senlin API allowed it to ignore conflicting requests.
As a result of all these efforts Overwatch now runs much more efficiently, said Truong, adding that he and his colleagues are now very active in developing the Senlin project upstream.
"We reduced the VM footprint used by Overwatch by 40 per cent, and that freed up capacity we were able to redistribute to other games and use for other workloads in our private cloud," he said.
Delta is a new market intelligence service from Computing to help CIOs and other IT decision makers make smarter purchasing decisions - decisions informed by the knowledge and experience of other CIOs and IT decision makers.
Delta is free from vendor sponsorship or influence of any kind, and is guided by a steering committee of well-known CIOs, such as Charles Ewen, Christina Scott, Steve Capper and Laura Meyer.
Ten crucial technology areas are already covered at launch, with more data appearing and more areas being covered every week. Sign-up here for your free trial of the Computing Delta website.