Betfair's eight-year quest to perfect DevOps
Betfair's director of tech explains the ups and downs of DevOps, culminating in a £30m investment in OpenStack and SDN
Online gambling giant Betfair, which recently merged with Paddy Power in a £5bn deal, has been working on creating the perfect DevOps approach for the past eight years.
Stephen Lowe, director of technology in the infrastructure division of Betfair, was speaking at Cloud Expo Europe this week, where he explained that eight years ago the company had what was deemed a ‘traditional' model.
"We had Devs who wanted to change and use continuous delivery, and we just started building microservices - we were trying to break up our monolithic architecture. And we had Ops people who were focused on stability, uptime and reliability," he said.
When microservices were introduced, Betfair was growing exponentially: recruiting 40 to 50 developers a month at its peak, and producing a lot of software, leaving Ops with a huge backlog of software to deploy.
"They were used to deploying a single 'monolith', so deploying one app into our production estate every month. Then all of a sudden we have guys who are producing hundreds of different microservices and changing them quickly. These guys were getting 50 to 60 deployments a week and they were all different," Lowe explained.
On top of this, the company's developers were spread across different countries, and the teams hadn't really standardised how they built those apps or wanted them to be deployed.
"Our Ops said ‘hold on, this is crazy, there is no standardisation'. And our Devs team said ‘if you're telling us what to do you're slowing us down and we can't beat the deadlines'...
"So, as a lot of people did at the time - and there is a certain amount of regret we have about this now - we built a DevOps team that was aimed at bridging the gap," said Lowe.
The DevOps team was made up of some scripting Ops personnel, as well as developers. The team could deploy apps but couldn't create VMs or change network configuration - those things remained in the hands of the Ops team.
Lowe said that the team made "a good start in optimising throughput", and Betfair decided to standardise the DevOps team.
"They were a small team, and we started using [DevOps tools] such as Chef, and we built a common container for all of our apps. We were now up to about 150 microservices at this point, so we had 150 separate deployments in our estate.
"Now that DevOps were there we got a bit more speed, but the real problem was the Ops guys felt someone was encroaching on their remit, so they wanted to limit what they could do and what access they had. Meanwhile, Devs were saying that this is great but we need more from [the DevOps team]," he explained.
[Please turn to page 2]
Betfair's eight-year quest to perfect DevOps
Betfair's director of tech explains the ups and downs of DevOps, culminating in a £30m investment in OpenStack and SDN
At this point the Devs team was still growing - it was recruiting 20 developers everry month, but the DevOps team remained at about 20 people.
"So [the DevOps] team was funnelling all of this through and it was better but not great, and we realised that we can't have a team that keeps scaling at this rate," said Lowe.
What Betfair wanted was for Devs to take all responsibility, so it embedded the DevOps team into its Devs team again, something which Lowe suggested "may also have been a mistake in retrospect".
The aim was to assign members of the DevOps team to separate Developer teams, so that they could pass on their knowledge.
But what actually happened was that those members of the DevOps team got used to the way the Devs teams had been working - and they effectively lost the standardising approach that the DevOps team had initially instilled; the Romanian, UK, Dublin and Portuguese teams' continuous delivery was different in all of the regions again.
"It was a good experiment, but our DevOps team hated it. They didn't have peers with the same skillset so there was push-back from them," Lowe said.
Betfair could have simply gone back to the old way of working, but instead it decided to give Devs even more responsibility. "So we took our DevOps team back out and made them the orchestrators; they were defining the technology, building some tools and standardising things. Meanwhile, we put our Devs on call," Lowe said.
He suggested that putting Devs "on call" was the "biggest cultural change and shock" in the model up until that point.
"We said: ‘You'll build software, then if it breaks at midnight you'll get a phone call and you'll have to come and fix it'. All of a sudden, operational concerns became a lot more important to them because they didn't want to be called at 2am or on a Sunday afternoon," he said. This meant the Devs team had limited access to some operations, but they couldn't build their own hypervisors. The tooling was dealt with by the DevOps team.
Lowe explained that as Devs were responsible for their own software, the Ops team "could smile a bit because they could phone them if it broke", the Devs team on the other hand weren't as enthused, but they got used to the new way of working.
More importantly, the quality of software dramatically improved and the number of software issues went down as a direct result of this added accountability.
At this point, Betfair was up to about 200 microservices and its monoliths had pretty much all disappeared. However, there remained some complaints.
Devs could deploy software, but had no control over hypervisors' retention or network bandwidth, among other things, while Ops had built a three-tier architecture that wasn't really suited to microservices.
So Lowe and his team reacted by bringing in automation.
[Turn to page 3 to read about Betfair's £30m investment in OpenStack]
Betfair's eight-year quest to perfect DevOps
Betfair's director of tech explains the ups and downs of DevOps, culminating in a £30m investment in OpenStack and SDN
"We've now gone into a world where we've built new infrastructure with a huge £30m investment, starting from scratch. It's based on OpenStack and called i2," he said.
"Frankly, all of those tools we couldn't give Devs control over, such as network configuration, virtual IP addresses, load balancing and hypervisor control, we're now building a whole set of automation to give them that control.
"It's based on an OpenStack cloud, with a lot of systems on RedHat, a software-defined network (SDN) of Nuage, and it's pretty slick," he added.
Developers can change the code easily, and they don't need to worry about knowing how to configure a Cisco router or a Citrix load balancer, Lowe said.
The new environment went live on Monday, and so far, Lowe hasn't had any complaints.
The future
Lowe explained that throughout the company's DevOps journey, it had brought Ops to Devs, but was now in the early stages of bringing Devs into Ops instead.
"Now all of our Ops teams are starting to get Devs, so they are there to build tools to make Devs' lives easier but still have control and accountability of the network and storage," he said.
Accountability is one of the key trends that Lowe and his team are pushing for - whether it is in Devs, DevOps or Ops.
And what Betfair is now doing is mixing teams with various layers of Devs, Ops and automation.
"We're effectively mixing them all in. It'll be very rare where you'll find a team that is all Ops or all Devs," he said.
Lowe is also encouraging people to cross-skill but some people aren't keen, such as those who have spent years getting Cisco certifications. Recruiting for this cross-skillset is a "big problem", he said.
"We do take a lot of graduates into our estate now. We train them, we can hire people to get into our teams. Some of them will be a majority of one thing or another, so you probably do need some Cisco skills, but we'd ask you to do other things as well," he added.
Computing asked Lowe what the merger with Paddy Power would mean for its DevOps team. "The goal is still the same. I'm lucky that Paul Cutter [CTO of Betfair] is still there so we're still on this same agenda.
"It's not as alien as we thought from the outside. We thought that Paddy Power was on a similar journey and when we met with their [IT team] perhaps they were a bit behind and maybe needed more time. However, when we told them our vision they were on board straight away," he said.
Computing's DevOps Summit 2016 takes place in London in July - and places are going fast. Learn more about how other end-user organisations are approaching DevOps, and their advice for doing it right. Places are free to qualifying IT professionals, so register now.