The IoT will drive the next phase of Hadoop's growth, says co-creator Doug Cutting
Hadoop's co-creator explains Hadoop's role in a more connected future
Hadoop's co-creator, Doug Cutting, believes that the Internet of Things (IoT) and cloud computing will form the basis of the next phase of growth for the 10-year-old big data platform, which so far has seen most uptake in the finance, internet and telecoms sectors. Indeed, he pointed out, it's already happening.
"Caterpillar collects data from all of its machines," he said. "Tesla is able to gather more information than anyone else in the self-driving business, they're collecting information on actual road conditions, because they have cars sending all the data back. And Airbus is loading all their sensor data from planes into Hadoop, to understand and optimise their processes."
While there has been much focus on driverless vehicles, a quiet revolution has been going on in all cars, he pointed out.
"Almost every car these days has a cellular modem in it. I recently heard that more than 50 per cent of new cellular devices are not phones but other things that are connected."
Cutting, who is chief architect at Cloudera, outlined the most important elements in the Hadoop ecosystem for the IoT.
"Lots of components are very relevant, things like Flume and Kafka, helping events flow in and streaming with Spark," he said, namechecking Apache Kudu, a data layer that Cloudera has recently incorporated into its distribution.
"What Kudu lets you do is update things in real-time. It's possible to do these things using HDFS but it's much more convenient to use Kudu if you're trying to model the current state of the world."
Cutting has said previously that he is not religious about Hadoop, and that "if people stop using MapReduce and HDFS we'll let them disappear."
With its ability to store and process huge amounts of data cheaply, Hadoop is often deployed as a "data lake", a central repository into which disparate data from multiple sources can be dumped before being processed and queried. But in IoT use cases there is a need to exchange data rapidly between the central store and edge devices, such as connected cars, which is where the newer tools come into their own.
"From Cloudera's perspective we don't want to get in a turf war in defending Hadoop against other projects, rather we are interested in finding the suite of technologies that serve our customers," he said. "Hadoop's performance fulfils some valuable roles there but as new things come along we are going to aggressively adopt the new projects. Kudu gives some valuable new functionality, as do Kafka and Spark."
Rival distributions Hortonworks and MapR also see a huge potential market in the IoT, with both also citing the connected car in recent interviews and promotional literature. Hortonworks has merged the Apache NiFi data flow system with Kafka and Storm to create HDF (Hortonworks DataFlow), a platform for collecting, transporting and analysing data from a multitude of sources.
Meanwhile MapR Streams is an event publish and subscribe framework that's integrated into the MapR Converged Data Platform and designed to replicate event data across disparate clusters.
While the majority of Hadoop distributions are still on-premises, cloud deployments are growing twice as fast.
"We are spending a lot of time on making our offerings work well in the cloud," Cutting said. "We're trying to provide really powerful high-level tools to make the lives of those delivering this tech a lot easier."