How Centrica is using Hadoop, Spark and NoSQL as part of its big data strategy

Centrica deploys tools from Hortonworks, Couchbase, Datastax and CA Technologies in a bid to improve customer satisfaction

Centrica, the utility whose brands include British Gas and Hive, has deployed a selection of big-data tools in a bid to improve customer satisfaction and business processes.

Daljit Rehal, director of strategic systems at Centrica, told Computing that about four years ago, the company was thinking about how it would deal with big data in the coming years, despite not having finished getting what he called the "old ‘small data' solutions" working properly.

At this point, the cost of upgrading the company's traditional legacy platforms was "prohibitively high", so the firm started looking at more cost-effective ways to invest its money, as well as starting to tackle some of the big-data challenges that were coming its way.

"Big data was arriving and we knew it was going to be quite credible to us because, as a company, we were rolling out smart-metering, the Internet of Things (IoT) and connected homes," said Rehal.

He added that while the traditional systems that Centrica was using were good for managing structured data, the company had to be prepared to manage unstructured data too - and join the two together.

When it asked one of its traditional suppliers for this functionality, Centrica was told that an upgrade of this sort would cost several million pounds.

"The challenge was to do it more effectively, so we created a Hadoop data lake with 250 nodes for less than £1m. Not only can it store more data [than the legacy system], it has got more functioning nodes and it's a lot faster - above all it can be used to join up data from structured and unstructured sources," Rehal explained.

For its Hadoop data lake, Centrica picked Hortonworks over the likes of Cloudera and MapR.

"We looked at everybody, and at that time we looked at which of these distributions had strong alliances with our strategic partners, such as SAP and HP. We also looked at which of these felt to us like being more committed to the spirit of the open-source nature of what we were doing, as opposed to getting us into a proprietary situation," said Rehal.

Centrica also factored in cost and the availability of skills, and built a team internally skilled-up in such technologies as big-data platform Hadoop, data-transfer application Sqoop and large-scale data processing system Pig.

Rehal added that the company had the skillset to go completely open source if it wanted to, but as an enterprise organisation the team felt it needed a supported distribution. "We opted for Hortonworks and it has been going well so far," said Rehal.

Putting together a big data architecture

Above the Hadoop data lake is a caching layer provided by NoSQL vendor Couchbase, and above this is an API gateway provided by CA Technologies.

Explaining how the new systems work, Rehal gave a typical use case of when someone calls an engineer to get their broken boiler fixed. Previously, the engineer would only have the information of the whereabouts of the boiler and the task they had to perform.

[Please turn to page 2]

How Centrica is using Hadoop, Spark and NoSQL as part of its big data strategy

Centrica deploys tools from Hortonworks, Couchbase, Datastax and CA Technologies in a bid to improve customer satisfaction

Now, he said, the data lake provides the engineer with additional information, such as whether the customer has complained recently, or whether he or she requires any further assistance.

"That data comes from the Hadoop data lake, but it is expunged out to a RESTful API, which has a Couchbase caching layer. We're at the point where several apps can be developed to make the most of this, but we haven't done this yet," he said.

"If we wanted to publish an SDK to allow our internal users, and also maybe someday external users, to develop apps then this would be the architecture to do that. So we're putting all the pieces in place, and in the case of some of our mobile apps for our own field forces we've already created those," he added.

This has meant that the company could move on from focusing on batch processing and analytics to making real-time decisions.

Spark is another big-data tool the company is increasingly using.

"We use Spark with Datastax's Cassandra product for our connected homes and we are using Spark as part of the Hadoop data lake. We've [also] got uses cases for using machine learning, artificial intelligence, forecasting or optimisation problems for which we're using Spark. We haven't gone operationally live with everything just yet, but it's coming in the next few quarters," said Rehal.

Computing's recent in-depth research of more than 500 IT professionals found that Spark had a fast-growing userbase in the UK. When Computing asked which big-data processing platforms the respondents believed their company would be using as their primary tool in 18 months, the biggest proportion of those companies who said they would be processing big data said it would be Hadoop (59 per cent), followed by Spark (17 per cent).

Rehal can't see Spark replacing Hadoop as the primary big data solution.

"I don't think so. For me, it serves as a good solution on top of your data lake for real-time use cases when you've got streaming switched on and the data is readily available in Hbase, then on top of that using in-memory solutions is a brilliant combination and it works fantastically," he said.

"To use Spark as the primary lake itself, I haven't given it that much thought because I'm still in the process of [using Hadoop]. No doubt there'll be some things that work straight away purely in Spark, and others that work indirectly through Spark or a combination of Hadoop and Spark.

"One of the things you have to be careful about with big data projects is not changing your mind every two weeks," he added.

[Please turn to page 3]

How Centrica is using Hadoop, Spark and NoSQL as part of its big data strategy

Centrica deploys tools from Hortonworks, Couchbase, Datastax and CA Technologies in a bid to improve customer satisfaction

Skilling up

At Computing's Big Data & Analytics Summit earlier this month, Demeter Sztanko, a data architect at global dating site Badoo, suggested that companies who implement Hadoop had to ensure they have enough technical expertise in-house including at least two people with experience with Hadoop to ensure everything goes as smoothly as possible.

But Centrica's Rehal doesn't completely agree.

"I think if there's one skillset you need above anything else it's Java development; you need people who can treat software in languages like Java and Python, and if you are in an IT-based organisation and don't have those skills then you're in the wrong organisation," he said.

Rehal said the company found plenty of people with those skills and that it didn't take long for them to adapt to coding in Hadoop. "The real problem for us was the knowledge of Hadoop and understanding things like configuration and compatibility, and which version on which product works on which platform," he said.

People who were more accustomed to using SQL required some hand-holding, he said. "They're experienced people and they didn't know how to do procedural coding or functional programming. For them it was harder and we had to get some training for them," he said.

But while the skills and tools are important - the main value from this strategy comes from the data itself, and Centrica has had to ensure that its data is governed properly and that it is of the right quality.

"We've had to create our own data quality and monitoring teams, processes and tools; some of those tools are things we've innovated ourselves. Our approach to data quality and governance is that it runs through the whole enterprise and has crowd-sourced accountability," Rehal stated.

"There are only a few people in the organisation with data monitoring skills, so we've created our own platform that pretty much holds people accountable for defining data, defining rules about the data and defining how to fix data - it's not just one person," he said.