Where next for Hadoop? An interview with co-creator Doug Cutting

Cloudera's chief architect talks about what features he'd like to see, the spat with Hortonworks and what Intel's investment has meant

Hadoop was created by Doug Cutting and Mike Cafarella in 2005 as a side project at Yahoo to enable a search project called Nutch to run across lots of machines. Now chief architect at Hadoop distributor Cloudera, Cutting says his invention is going from strength to strength.

"We're still seeing pretty much 100 per cent growth year-on-year and Cloudera has been doubling in both revenues and customers for quite a time," he says, explaining that the only real factor preventing Hadoop from growing even faster is the lack of available skills and experience, both within Hadoop distributors like Cloudera and out in the wider world.

"We're still in the early stages," he says. "We're growing about as fast as we can responsibly. You need to have a certain number of people with some mastery of it to expand sensibly, and if you are doubling every year then half the people are still in their first year of experience.

"So how long can we're sustain that? I don't know. We're a long way from saturation but for most customers we're still a very small portion of their IT so we have a long way to go before we see that growth slow. It's a good market to be in."

Growth remains strongest in the areas where Hadoop first took off, in web companies like Yahoo. But Cutting says uptake is rapid across all sectors, and that in areas such as financial services, which is probably the second most enthusiastic adopter, best practices are emerging that make it easier for firms to get started.

"In financial services it's risk and fraud. We've seen enough folks go through that so when a bank comes along saying 'we've seen X bank use Hadoop to assess their risk' we can say 'here's how they've done it in general terms, here's the set of tools, and here's how they've put them together'."

Hadoop is already much more versatile and user-friendly than it was in the early days and innovations such as Yarn, Impala and Spark as well as a hardening of the platform's security have all made it more "enterprise ready" too, as Dee Mitra of Centrica British Gas explained recently.

"Where isn't it going to go?"

As more use cases emerge, and as the ecosystem continues to evolve, Hadoop will undoubtedly become a larger part of the IT infrastructure in firms with a lot of data to process. So where will it go next?

"You almost want to ask where isn't it going to go," says Cutting. "There's a lot going on. I'm keen to see better support for transactions, systems where you are updating values in real-time and are still able to perform analytics on the same data."

What does that require?

"It's not really clear. There are a few traditional ways out there, OLTP-type approaches. There is a project from HP [Trafodion] that is being brought to Apache which is an open-source OLTP engine that lives on top of Hbase [The NoSQL database included in Cloudera's Hadoop distribution]. That will satisfy certain applications," Cutting says.

Another area is ad-hoc analytics.

"HBase is not particularly good for analytics as compared to using something like Impala over Parquet files, so we will see systems that give you the sort of query performance that you see from Impala over Parquet at the same time as you do random access incremental updates in the way you can with HBase."

Asked whether some IT people are so bowled over by the number and choice of big data tools that they neglect to think how they will use them, Cutting agrees that this can be the case, but says that as use cases grow this issue will diminish.

"It's in an early stage of maturity so that's not unexpected, but I think over time people are going to think about the functionality you've got in the distribution. You could have a SQL engine for analytics queries. You've got a NoSQL engine for reporting queries," he says.

So are companies like Cloudera, which, thanks to support from the likes of Intel (see below) and its vast marketing budget, distracting the market from the bigger picture?

"There is confusion but I think it's mostly because people are new to it and do not have much experience," Cutting says.

He adds: "It is also that the technology is still young. It isn't packaged as well as it could be. We are trying to improve that, make a simple experience. Training is a big part of our business, training operations and professional services, and we see those as enablers to the real long-term business, which is folks consuming the platform and deploying applications on it."

So what's the aspect of Hadoop that people find most difficult?

"I think it's mostly just the novelty. People are used to having a siloed system that can do one thing, so you have to get your problem to conform to that. In Hadoop you need to do more work upfront to decide what it is that you want to do and which tools you are going to deploy to do it... I think the flexibility makes it harder for folks to find their way around."

"If Hortonworks had called ODP something else I'd be far less bothered"

And so to the family feud between the main Hadoop distributors Cloudera, Hortonworks and MapR, each of which have taken their own distinct paths to supporting the technology. The Hortonworks distribution comprises only Apache Software tools, while Cloudera and MapR have proprietary add-ons, such as Cloudera's Impala and Cloudera Manager and MapR's MFS file system.

Recently, Hortonworks launched the Open Data Platform (ODP), along with Pivotal and a number of other collaborators, including IBM.

The ODP creates a "standard" Hadoop kernel, currently Apache Hadoop 2.6, comprising the file system HDFS, the data-crunching programs MapReduce and Yarn, and Ambari for provisioning, monitoring and managing Hadoop. ODP members will ensure that all these elements are upgraded at the same time, in a controlled way, which they say will enable software vendors to certify against a single version of core Hadoop, rather than the many versions on the market now.

Speaking to Computing last month, Hortonworks president Herb Cunitz insisted that the aim of the ODP is standardisation rather than freezing out his firms' rivals, and insisted that they had been invited to join the project. But those rivals see things rather differently.

"It's a vendor coalition," says Cutting. "That's fine. The question is what problem are they attacking? I have yet to see it articulated in the way that I find compelling."

Could the ODP be designed to head off a problem that might emerge in the future, to prevent Hadoop from fracturing into many incompatible versions?

"Maybe, but it feels more like they're trying to fork it. The Apache Software Foundation is a place where we collaborate very successfully and have a decision-making process for us to all work together, so we certainly don't see that. We don't see a future need for that."

Cutting continues: "We don't see fragmentation. Supposedly the problem is that releases of the different distributions are creating problems for people building applications. We don't hear that. We all work together in Apache on releases and we all make sure that changes are done back-compatibly, so that if people aren't perfectly synchronised then you can still have applications that run across all the distributions without any trouble."

Cutting also takes issue with the name "Open Data Platform" and what it implies.

"If they'd called it something else I'd be far less bothered," he says. "They claim that it is the standard when they're not incorporating the majority of the players. They called themselves open. We were invited to join but under terms that it was clear we would never accept. For example, it binds you to using Ambari [the Apache alternative to Cloudera Manager]. We are not going to use Ambari. There were other aspects of the membership proposal that were completely unappealing."

Cutting continues: "Mostly I think this mantle of open and standard is deceptive. It is neither open in that everybody's really invited on equal terms to play, nor is it a standard. It's a minority of people out there."

So why would IBM support it?

"I've no idea. IBM is a complicated organisation. It's hard to guess whether this one part of IBM wanted to and others didn't. It's a many-headed beast."

Asked whether the tussle over the ODP had anything to do with his stepping down from the board of the Apache Foundation in March (he relinquished his chairmanship a year earlier) Cutting says that no, the timing was a coincidence.

"I had been on the board the six years, three of those as chair, and I felt I should turn over the reins to other folks.... I think it would be healthier if more the old-timers stepped aside and let new blood on to the board."

What is not a coincidence, however, is Cloudera's increased support for Apache.

"At the same time that the ODP was announced, in fact the very same day, Cloudera announced that it was increasing its sponsorship of Apache. That was not intended to be a coincidence. So we double down on Apache as the place that we see is the right one for collaboration, and not the ODP."

"The Intel deal helps us remain an independent company"

This time last year Intel bought an 18 per cent stake in Cloudera for an unprecedented (for a big data firm) $740m. What difference has this made?

"On the engineering front we've been able to collaborate with them on a lot of security features, getting encryption into HDFS so that all the data can be encrypted at rest, a feature that a lot of companies require for PCI compliance," Cutting says, adding that Intel had also helped Cloudera expand into areas such as China where the chip-maker already had users of its own (now defunct) Hadoop distribution.

"It also helps us to remain an independent company, to not be acquired. The terms of the deal are structured so we become harder to acquire."

Do those terms also restrict Cloudera from certifying other chip-makers' platforms?

"If there's a particular feature that we optimise then we have to do it at the same time or later than with Intel. But in practice Intel has 90-per-cent-plus of the data centre processors so they are not interested primarily in growing market share. They want this technology to work well on their hardware. That's the way that data centres will grow and that's the way Intel will grow."

"You'd be surprised how much is still happening on premises"

The big cloud providers such as Amazon and Microsoft are betting that those data centres will predominantly be theirs, but Cutting says that most Hadoop implementations are currently hosted by the users themselves.

"You'd be surprised how much is still happening on premises. It's historically been on premises and it remains on premises. We've always had some in the cloud and it's a growing portion, but it doesn't look like its going to become the majority for the next couple of years."

He adds: "For people doing large deployments on clusters the cloud isn't always economical and if you've got other on-premises systems that need to interact going back and forth between the cloud and on premises is going to be slow and expensive. So most folks are still deploying on premises.

"We're pretty much agnostic as to on-premises versus cloud, but so far the majority remain on premises."

"My favourite application? That's like asking me to pick my favourite child"

So what's the question Cutting least likes answering about Hadoop?

"That would be what my favourite application is. It's like asking what my favourite movie is. I like a lot of films, I don't like to pick favourites. Or it's like picking your favourite child: I like them all in different ways."