Interview with Kirk Dunn of big data firm Cloudera
Cloudera COO explains why the age of big data requires a new way of thinking
In a speech to the Royal Society on 9 November, Chancellor George Osborne insisted that big data analysis will be one of the cornerstones of the UK economy in the years to come.
Claiming that the UK is the world leader in the collection, tabulation and provision of data, maintaining extensive datasets in areas such as healthcare, demographics, environmental change and food, the Chancellor said: "The next generation of scientific discovery will be data-driven discovery, as previously unrecognised patterns are discovered by analysing massive data sets."
"Business will invest more as they see [the government] invest more in computational infrastructure to capture and analyse data flows released by the open data revolution."
One company that will have noted these words with interest is Californian software firm Cloudera, which has been in discussions recently with the UK government over how best to process these massive data sets.
Cloudera's CDH, now on version 4, is the firm's open-source Apache Hadoop distribution, which includes components that make Hadoop more user- and business-friendly, such as the column-oriented distributed database HBase to allow SQL-type queries, and other elements that provide workflow, security and integration with other systems.
As well as being free to download from the web, CDH is bundled into Oracle's high-end Big Data Appliance (BDA), demonstrating Cloudera's aim to cover all the bases when it comes to potential customers. To this end, too, the firm is directing much effort into its training and certification programme.
While some might view Hadoop as experimental and thus something of a risk, COO Kirk Dunn insists that the barriers to entry are much lower than many assume, and that anyone with SQL and basic Java skills could - and should - make a start with CDH.
"We're a next generation data management platform." says Dunn. "CDH is very lightweight and inexpensive to start, yet the return is customer intimacy."
Dunn believes that if technical people start thinking in terms of data rather than IT the business case for giving Hadoop a try becomes much easier to make.
"There's not a company on this Earth that wouldn't want the sort of relationship with their customers that Facebook has. Where are you going to learn more about your customers? From how you think about IT or how you think about data? The answer is obvious," he says.
"If you're unwilling to try new things, then innovators in your field are going to run past you."
Mix and match
The promise held out by MapReduce technologies like Hadoop is that by treating all data as organic and tipping it wholesale into a big mixing pot, relationships will emerge that would never be arrived at were the data to be artificially forced into a fixed schema.
Interview with Kirk Dunn of big data firm Cloudera
Cloudera COO explains why the age of big data requires a new way of thinking
Traditional databases use an approach called schema-on-write; schema must be created before any data can be loaded. Hadoop, on the other hand, is schema-on-read. Data is simply dropped into the file store and columns are created by a probabilistic interpretation of queries on the data.
Moreover, the more data you have and the more it is queried, the better it works. However, Dunn admits that explaining this can be difficult.
"You could see the quizzical look on his face," Dunn says of one customer, an experienced datawarehouse operative.
"He said 'When I make a schema I can call a column PostalCode. Then when I drop the data in, I know that what is in that column will be the postal code. You have no columns. How do you know what the postal code is?' Great question.
"It's the statistical nature of data," Dunn explains. "MapReduce works out what the relationships between the data are. The system will work out [by its position in relation to other characters] which set of characters is the postal code. After enough swings at the plate, enough iterations, it just becomes obvious. That's why you don't have to create the categorisation. The data will do it for you."
In the "old data world", as Dunn labels traditional analytics, you had to know the answer to the question you wanted answered before you ask the question. In the new data world you want the data to show you what questions to ask.
"The best example is the Google spellchecker. Google knows all the permutations, all the ways you can mis-spell a word. A word mis-spelled in a certain sentence could have one of many meanings. But Google knows which one you actually mean, because they've seen the word in the context of the sentence millions of times, not just in the context of the mis-spelling. They haven't simply categorised the word; they've allowed the word to be categorised in the context of the sentence. Allowing the experience to form the question."
Cloud roots
The name Cloudera gives a clue to the company's origins. The original idea was to deploy Hadoop in the cloud. Indeed Dunn sees opportunities in forging further partnerships with public cloud providers like Rackspace, running CDH as a cloud application for use by customers to process data held in that environment.
However, Dunn concedes that analytics in the cloud has not taken off as quickly as predicted. For now, most analytics is performed in a datacentre, which is why Cloudera has forged partnerships with the likes of Teradata, Netezza and Oracle where, Dunn says, the firm's solution adds an extra string to these vendors' bows, rather than competing with their datawarehousing and BI offerings.
"Hadoop was created to deal cost effectively with a lot of randomly generated data at a high volume and variety. This was a new problem requiring a new technology. Cloudera will be a billion dollar company before we threaten the traditional data-warehouse business because it's not about trying to take a slice of an existing pie, but to make the pie dramatically bigger."
As the Chancellor pointed out, this pie is turning out to be a very large one indeed.