Big Data & IoT Summit 2017: Skyscanner aims for half-second daily downtime

When building a data platform, it's not actually good to get it right first time, suggests Skyscanner engineer Michael Pollet

Michael Pollet, a software engineer at airline price comparison website Skyscanner, was brave enough to take the stage and talk about what the company had learned from a year of - in his words - "building the wrong data processing system".

Skyscanner now has more than 800 staff, 40 of whom work in the data team. This is a necessity when you consider the amount of data that the website handles, said Pollett: 50 million unique monthly visitors; and, daily, 12 million live searches; more than 100 million requests for partner (airline) data; and five billion prices returned. The site deals with 155,000 messages every second: 6.4 billion every day.

Skyscanner aims for 'five nines' reliability: 99.999 per cent data integrity. That leads to a potential loss of 64,000 messages a day: equivalent to just 0.5 seconds of downtime.

In its quest, Skyscanner began to build a data platform that would take in partners' data and output it in a usable format. Before building a platform, said Pollet, you need to know why you are building it, what data you are sending, and whether you favour integrity or latency when things go wrong. Without knowing the answers to these questions, you won't know when you've finished.

Skyscanner's starting direction was to have a unified log of real-time data; a long-term archive; and a structure for its output data (the necessity for a structure returns to a point that was raised several times throughout the day; the democratisation of data, making it easier to access and use for people who aren't data scientists). Pollet recommended starting small and building out when you know what you want the system to do.

The first iteration of the system was very simple: data came from the partner and was sent through a fast streaming layer (using Kafka and Samza), then to the serving layer; this consisted of the archive, metrics and ELK logs. It was soon found that Kafka was causing problems, and data was taking up to a month to make its way through the system, which was hardly ideal.

The addition of an SDK made it "much easier" to find out what people wanted to use Skyscanner's data for. However, regular SDK releases meant that data producers (the partners) were left to find bugs and had to update regularly, causing them to be left behind.

Skyscanner found that the system's reliability was deteriorating due to SDK bugs, and maintenance was growing faster than features. The company solved this with an HTTP interface, simplifying the SDK to focus on reliability.

The updating burden was removed from producers, but meant that (again) data was sometimes not arriving in the archive - this time because the interface had to buffer messages, leading to eventual drops or losses.

Being able to scale down is just as important as scaling up - Michael Pollet, Skyscanner

The final version of the system has two separate sections: one 'reliable' batching layer and the original fast streaming layer. Data is split before the HTTP interface by a data router, which determines the layer it should be sent to. A diagram of Twitter's own processing system showed a very similar arrangement to that used by Skyscanner - "We probably arrived at the best solution through trial and error," said Pollet.

Skyscanner's data system naturally evolved into a similar design to that used by Twitter

At the end of the process, Skyscanner says that it has reached the elusive ‘five nines', with scalability into the bargain. Not just scaling up - Pollet told us that scaling down is just as important. "When it's 5am and you're not getting many data requests, we could only require two or three servers, not nine. That's a big cost saving."

Despite the ups-and-downs of the system building, Skyscanner would probably not skip straight to its final solution if going through the process again. "We used an evolving architecture that delivered a usable system early on," said Pollet. "By doing it that way, producers were able to send data and talk to us about issues they encountered so that we could fix them in the next iteration."

The only question that there was time for at the end: "How do you manage your data retention policies when amassing big data?"

Pollet admitted that his answer was a bit of a "cop-out". Skyscanner processes a huge amount of information, but it is all valuable and useful.

Some of it is kept for less time, but mostly it's retained forever and, luckily, storage is pretty cheap these days - so the policy amounts to "throw it at the wall and see what sticks".

Computing's IT Leaders Forum 2017 is coming on 24 May 2017. The theme this year is "Going Digital: Why your most difficult customer is your best friend".

Attendance is free, but strictly limited to IT Leaders. To find out more and to apply for your place, check out the IT Leaders Forum website.