What's new in Hadoop?

The distributions are diverging as the latest additions make clear

Over the past 10 years Hadoop has grown from a distributed file system plus MapReduce to an ecosystem of around 30 components, most, but not all, open-source Apache Software Foundation (ASF) projects. These are mixed and matched by the main Hadoop distributors Hortonworks (which is pure Apache open source), MapR and Cloudera (which have some proprietary elements) to meet specific needs and also to differentiate themselves from their rivals.

As it has become more widely deployed as a general-purpose big data storage and analytics platform, ease of use, security and data governance have come to the fore as priorities. During the recent Hadoop Summit in Dublin, organised by Hortonworks, Computing chatted to all three distributors about their approach to these issues and the latest developments, and also to some third-parties and analysts to find out what these developments say about the future of Hadoop.

Quick links

Security and compliance

Data in motion

Visualisation

Cloud

Distributions diverge

What's new in Hadoop?

The distributions are diverging as the latest additions make clear

Security and compliance

As Hadoop is increasingly adopted for mainstream tasks by different departments each with its own data and applications, security and audit capabilities have risen steadily up the priorities list. It is important to know where data originated, how it has been processed and who has access to it.

Last year, Hortonworks did a lot of work integrating Apache Ranger into its ecosystem, meaning that Spark, Hive and Hbase are now all protected within the same framework. This year it has merged Apache Atlas with Ranger to add data governance capabilities, enabling data on Hadoop to be tracked and audited and providing role-based or location-based access controls.

"Atlas is really super," said Ciaran Dynes (pictured), VP products at data integration company Talend. "It has profile tags built in and that allows you to do geo-encoding to say here's a piece of data that can only be read in this jurisdiction, or you can say only this job role can access it."

Beefing up security, this time in a predictive sense, was behind an announcement the firm made at the summit where it introduced Metron, a framework for analysing data as it's ingested.

"A lot of threat detection is retrospective," explained Andy Leaver, VP international operations at Hortonworks. "Metron is all about predicting threats and failures and shutting things down before it happens."

"Metron is the first of Hortonworks' turnkey solutions," said Mike Merritt-Holmes of Big Data Partnership. "They are starting to roll out solutions that were built in tandem with partners and customers. They've built a whole pipeline so you can ingest data and apply machine learning models to identify potential security issues. It will be interesting to see how it is taken up as it will drive their ambitions to create other turnkey solutions."

Cloudera's Metron equivalent is Open Network Insight (ONI), a project to which Intel is a major committer. VP of professional services Alex Bartfeld (pictured) told Computing that ONI is "a full stack security analytics solution more specifically focused on security". He pointed to "a growing portfolio of cyber security ISVs and GSIs" that are adopting ONI, including Accenture, Cloudwick and Securonix.

For governance duties Cloudera's CDH distribution turns to Apache Sentry, which, like Atlas, graduated from incubation in 2016 and also provides role-based access control to data and metadata. RecordService performs a similar function with Apache Spark and Cloudera Impala. Cloudera Navigator, a proprietary tool that dates back to 2013, takes care of the data audit and lineage tasks.

All the main vendors are moving toward a platform approach (more about that later), with MapR arguably the earliest proponent of this strategy. Jack Norris, SVP data and applications, said this is also behind the firm's approach to security.

"From a platform perspective, MapR houses permissions where they belong, with the data, not separately," he said. "We recommend using edge (client) nodes and firewalls. This approach offers secure access to clusters; we've never had a customer question this approach."

He criticised Knox, the perimeter security gateway used by Hortonworks, as "a bolt-on solution, which introduces another level of (unnecessary) complexity and yet another component to manage".

Where Hortonworks has chosen Metron and Cloudera has ONI, MapR uses Quick Start for real-time security logging. Again, Norris chose to emphasise MapR's converged platform approach saying that it "eliminates the processing delays and supports real-time, low latency applications" such as Quick Start.

Commenting on the different approaches, Matt Aslett, research director, data management and analytics at 451 Research, said: "I expect security information and event management (SIEM) to be a focus area for all Hadoop-related vendors going forward and it is not an area where they seem prepared to work together."

What's new in Hadoop?

The distributions are diverging as the latest additions make clear

Data in motion

Since purchasing its creator Onyara, Hortonworks has taken the Apache NiFi data flow system and merged it with messaging systems Kafka and streaming data processing engine Storm to create HDF (Hortonworks Dataflow), a platform for collecting, transporting and analysing data in motion from a multitude of sources.

"It's a rich visual interface to allow non-data scientists to easily take data from multiple sources and land it in the data lake," said Leaver (pictured), who mentioned its potential in the context of a smart city where there are potentially millions of edge devices producing data. "It's about having a two-way conversation from the data lake back out to those edge devices."

The delegates we spoke to generally saw this as a positive step that would have immediate benefits. Some said it could threaten the business model of third-party integrators like Talend. However, Dynes did not seem overly worried.

"It's a very heterogenous world. We integrate with databases, we integrate with Hadoop, we integrate with data warehouses, and cloud. All these things come with metadata and services. If you're all-in on Hadoop and you're married to Hortonworks or Cloudera then good luck, but most organisations aren't like that."

Meanwhile, Cloudera has initiated the Apache Kudu project, a storage system for tables of structured data that will sit somewhere between HDFS and HBase, according to Bartfeld, who also mentioned the IoT. "It's a new data layer that we think will run the important applications of the future like the connected car, where you have high velocity data coming in and you want to analyse it in real time."

MapR's technical evangelist, Tug Grall, (pictured) spoke about MapR Streams, his company's data streaming technology, also in the context of connected cars. "You can have a small MapR cluster in a car, and you can automatically push data from your HQ to all the cars using MapR Streams," he said.

"MapR provides all of the capabilities of NiFi, Spark Streaming and Kafka, but has converged them into a single platform," said Norris. "It allows organisations to develop more comprehensive applications that can respond in real time because of that convergence."

Norris and Bartfeld both emphasised the options for customers to select other data ingestion options, either from third-party ISVs or from within the Apache stable, with tools such as Flume, Spark Streaming and Sqoop.

"We see no overlap with Kudu, Spark Streaming and Kafka and third-party ISVs in the ETL or ELT space, they are all certified partners that build significant value on top of CDH," said Bartfeld.

Speaking of what all this means for ISVs, Aslett said: "Cloudera's focus for real-time ingestion seems to be based on Spark Streaming and Kafka. Talend, Pentaho and others all offer additional value in terms of ingestion and integration expertise and managing pipelines and will continue to partner with Hortonworks, Cloudera and MapR. However, increased differentiation by these vendors will arguably make life harder for ISVs."

What's new in Hadoop?

The distributions are diverging as the latest additions make clear

Visualisation

Hortonworks promoted Apache Zeppelin as the visualisation tool of choice for Hortonworks, helping business users understand the results of queries and also making life easier for analysts. It is an interactive web-based interface that can act as a front end to Spark and Hive and can run programs and queries in many different languages.

"You could have a Bash script followed by a Python script and then a Hive query, it allows you to look at each step and see the results, so when you are prototyping or if you're a data scientist. You can go through a pipeline and reiterate at each step, which allows you to quickly get to the end goal," explained Merritt-Holmes (pictured).

"Zeppelin is not a GUI for Spark it is a data scientist notebook," Norris said. "MapR customers today are using things like Hue (iPython) and Jupyter to provide general-purpose notebook support. We don't see a reason to dictate a single approach for the broad data science community that comes from different backgrounds and prefers different tools."

Cloudera's Bartfeld gave a very similar response.

"It is an open choice for the consumer to choose what notebook they would prefer to use, Jupyter, Zeppelin or Sense.io, among others. The problem is more than just shipping a GUI for Spark, or even a notebook tool as they all have their limitations, for example Zeppelin only works with Spark and Hive."

The acquisition of Sense.io (another data scientist notebook) in March by Cloudera is a sign that despite Bartfeld's protestations, the company is keen to carve out its own visualisation niche too. As for MapR, 451 Research's Aslett believes it may eventually throw its weight behind Zeppelin too.

"Cloudera recently acquired Sense.io, which is aimed at data scientists, and also offers the Ibis project. It is also a big supporter of Hue, which offers a notebook UI for Spark. While MapR does not include Zeppelin in CDP they have promoted it via blogs so it may just be a matter of time."

What's new in Hadoop?

The distributions are diverging as the latest additions make clear

Cloud
The name Cloudera reflects the founders' original intention to offer Hadoop in the cloud. "Back in 2008 it wasn't really happening," said Bartfeld. "But now it is. It's still only about 10 or 20 per cent of customers doing it that way but it's a number that's growing twice as fast as on premise."

Indeed everyone we spoke to in Hadoop distributors and ISVs said that cloud is really driving change. All the main Hadoop vendors have deals with the big cloud providers, which offer Hadoop as a service with Amazon, Microsoft and Google.

Cloudera Director has been available since 2014 - a long time in the fast-moving world of Hadoop - and is that company's tool for deploying CDH in the cloud. It integrates with Cloudera Manager for managing and monitoring production workloads.

A newer kid on the block for deploying Hadoop to the cloud is Hortonworks Cloudbreak, the result of another acquisition, this time of SequenceIQ. Ambari, a part of Hortonworks' core platform, is used to manage cloud-based clusters.

So what about MapR?

"MapR Control System certainly has some capabilities for managing cloud deployments. I haven't seen them making a lot of noise about this however," said Aslett.

"The MapR installer provides a generic deployment mechanism for both on-premise and any cloud provider," offered Norris. "Beyond that, MapR works with the major cloud providers like Amazon and Microsoft to offer cloud-optimised deployment scripts that are integrated into their cloud marketplaces."

Deploying Hadoop clusters in the cloud allows for easier integration with other services running on the same cloud platform, a point made by Talend's Dynes.

"If Hadoop is to become the dominant force that we want it to be it really needs to become a marketplace, a platform for value added services to be added on top."

What's new in Hadoop?

The distributions are diverging as the latest additions make clear

The distributions diverge

What do the recent changes in Hadoop mean for the divergence of the distributions? It has been argued that there is not enough room for three Hadoop vendors working in broadly the same space.

Earlier Hadoop efforts by Intel, WANdisco and Teradata all fell by the wayside and Pivotal's distribution has now standardised on Hortonworks HDP, that vendor having just certified HAWQ, Pivotal's enterprise SQL-on-Hadoop analytics engine, which is now an Apache project.

That relationship brought about the Open Data Platform (ODP), a controversial project launched last year by Pivotal, Hortonworks, IBM and others to standardise on certain core Hadoop elements - not all of which are shared by Cloudera and MapR. Cloudera's Doug Cutting said the ODP felt like an attempt to fork Hadoop.

However, both Cloudera and MapR have proprietary elements and it seems all three distributors are attempting to block out the competition by adopting a platform approach, making each disribution progressively less compatible with its rivals - although all continue to pledge allegiance to the Apache Software Foundation.

The Hadoop ecosystem contains a large number of elements. 451 Research's Aslett has looked at the three main independent Hadoop vendors (and also at IBM's) to see how many elements they have in common.

Hortonworks HDP has 11 Apache projects that are not found in Cloudera's CDH, which in turn has six projects that are not included in HDP. While HDP is pure Apache open-source, Cloudera has some proprietary elements such as Director, Manager and Impala, and MapR has always deployed its own proprietary file system MapRFS.

"Overall, while there remains a core group of foundational Hadoop projects that are offered by multiple distributions, the major Hadoop distributions are increasingly differentiated in terms of the Hadoop projects included, as well as additional functionality offered on top or alongside," Aslett said.

"Sometimes this means that one vendor is simply earlier to adopt a specific project than the others, but in some areas - particularly security and management - it is because they are backing competing approaches."

With the distributors consolidating around their own core platforms (MapR's latest is called the Converged Data Platform), enterprises need to be aware of what projects are supported by which vendors (and which aren't) and to plan accordingly.

"It will become increasingly difficult for users to move from one to the other. It also means that ISVs are likely to have to work harder to certify their products with multiple distributions to ensure they are compatible with the diverging functionality," Aslett said.

"Cloudera and MapR in particular will continue to offer proprietary complementary products/services and Hortonworks will continue to work on partnerships with proprietary vendors."

Hadoop, which has become the de facto big data platform of choice in part because of its ability to break down data silos may be splitting into new silos of its own.