Hadoop exemplifies the promises and the pitfalls of open source - here's why
Fast moving and flexible, Hadoop has emerged to meet the data processing challenges of today, but it's not without challenges of its own
Visitors to the Hadoop Summit in Dublin last week were confronted with a bewildering array of unfamiliar "animals" - some new, some re-homed, others acquired, with still more lurking in the hothouse incubators.
The "animals in the zoo" metaphor is a favourite of pundits describing the Hadoop ecosystem and it's one that's happily taken on board by developers who almost all seem to favour animal logos. (Incidentally, the Flink squirrel seemed almost as ubiquitous as the Hadoop elephant this year.)
It may have started out as a file system (HDFS) plus MapReduce but the Hadoop zoo has grown enormously, and the sheer number of species and subspecies on show can be overwhelming. Open source lends itself to cross-fertilisation and software equivalents of the liger (lion-tiger cross) and the zedonk (zebra-donkey) abound. To stretch the metaphor a bit further, the hybrid vigour of the cross-breed let loose in an open, unfenced environment allows successful open source products to spread very quickly across the landscape, certainly much faster than proprietary equivalents.
A good example is Apache Spark, which sadly doesn't have an animal logo but which in two short years has become such a dominant part of the Hadoop ecosystem (and outside it) that third-party vendors like Talend and IBM have rebuilt their integration and ETL [extract, transform and load] suites on top of it. But so fast are things moving that even Spark is now described as old hat by the hipsters in the Flink T-shirts.
Summit organiser Hortonworks is a case in point, becoming the first technology start-up ever to hit revenues of $100m within four years - which isn't to say it's actually making any money. No one said that would be easy, and it's not. Without the luxury of proprietary IP to fall back on, the capacity to raise awareness is one of the few real differentiators available to each of the rival distributions as they battle for supremacy, and that means splashing the cash.
"We have to spend a hell of a lot on marketing," sighed Tim Hall, VP product management. Indeed, Hortonworks returned to investors for a further tranche of funding earlier this year in part to keep the publicity pump primed.
The same applies to private competitors Cloudera and MapR, too, even though those distributions contain a handful of proprietary elements. The bottom line is that it's hard to have a healthy bottom line, profits-wise, when you're open source. Nevertheless, Hortonworks CEO Herb Cunitz insists his company is on track to become profitable.
"We'll be cashflow positive by the end of 2016," he promised, adding: "Open source always does well in a down market".
The global economy may be in the doldrums, but the use cases for the 11-year-old invention of Doug Cutting and Mike Cafarella have never been greater, with the general drift towards cloud and digital business, and with billions of new devices poised to come online with the nascent Internet of Things (IoT). Openness is a strength as it fosters cross-compatibility, which in turn aids the breaking down of data silos. Data needs to be able to flow where it's needed and bottlenecks must be removed.
Suck it up
There was much talk about the importance of simplifying the process of data ingestion, i.e. the sucking of data from data warehouses or other silos into the Hadoop "data lake". Inevitably, there are almost as many open source tools for doing this as there are fish in the sea or birds in the air, but Hortonworks' acquisition of Onyara and subsequent rebadging of Apache Nifi (a tool that was originally open-sourced by the NSA) as Hortonworks DataFlow bumps up that number, and it's one that most people we spoke to thought would be a significant addition.
Why did the NSA open source that tool in the first place? No one we asked was quite sure, but generally it's about cost savings and utility versus ownership, explained Mike Merritt-Holmes of Big Data Partners.
"It's often about cutting costs, but also if you want to speed up the development you'll open source it. Do I care if other people can use it? Am I making money from it? If the answer to those two things is no, then get it out there," he said.
Hadoop exemplifies the promises and the pitfalls of open source - here's why
Fast moving and flexible, Hadoop has emerged to meet the data processing challenges of today, but it's not without challenges of its own
Open but ugly
Common to the newer data flow tools (MapR's is called Streams while Cloudera's is Kudu) is a desire to simplify operations so that non-technicians can use them and business users understand the output. Historically user interfaces have been a weak point in much open source software - look how slick an Apple or Windows desktop is compared with Linux, even now. There are exceptions of course, but frequently open source enterprise software features default GUIs that are poorly designed and unintuitive (or even non-existent). This is something that Hadoop vendors and the open source community in general are now starting to address through tools such as Apache Zeppelin, a visualisation tool or notebook that can sit atop a variety of back ends.
Security and governance were other touchpoints during the Hadoop Summit. In classic open source fashion Hadoop was co-opted from its original task within Yahoo's search machinery, where data security and access control were not core necessities, to become a general-purpose big data framework in enterprise - where they absolutely are.
Hortonworks has beefed up its security credentials by integrating security system Apache Ranger with other elements in the ecosystem and recently merged it with Apache Atlas governance software. At competitor MapR's booth, technical evangelist Tug Grall claimed that security and governance are baked into his firm's integrated Converged Data Platform (CDP) "on the file system, the streams and the tables, you can say who is able to do what - security is not a product, it's a feature".
Meanwhile, over at the Cloudera stand VP services EMEA Alex Bartfeld was keen to talk about the Open Network Initiative, a new collaboration with Intel that uses machine learning to analyse network traffic for suspicious behaviour. "We've always worked with government, telco and financial services who are the biggest consumers of cyber security tools in place," he said, pushing Cloudera's security credentials.
Away from Hadoop, a series of recent bugs in critical open source software have shaken open sourcers out of their complacent reverie. In the aftermath of the discovery of the Heartbleed vulnerability it emerged that OpenSSL was mostly maintained by just two developers. Belatedly, the Linux Foundation (which is supported by most of the largest technology vendors) has decided to act, creating the Core Infrastructure Initiative to identify and fix problems with critical applications.
Everything changes
The Hadoop zoo, with it's ever-evolving menagerie of somewhat scruffy and oddly named animals, epitomises open source in many ways, as mentioned above. Some components are stable and mature, others are barely out of the incubator. The large number of developers and relative newness of projects such as Spark mean that they are changing very quickly. Indeed, Hortonworks is now opting for a two-stage release schedule, shortening the cycle for its processing and analytics components while retaining the longer timescales for its core system.
Understanding by business and public sector of the benefits of open source software has come on in leaps and bounds in recent years and as a consequence every major vendor (even Microsoft) now sings its praises. It can no longer be ignored. Used intelligently it can save money and increase agility.
Open source is everywhere. It is right at the cutting edge of innovation and it is also the basis of the new standard platforms, such as HDFS and OpenStack, where stability and predictability are key. It's hard to imagine the emergence of a successful new closed source programming language or proprietary operating system these days, and in many businesses an "open-source-first" strategy is becoming the norm, with in-house developers even spending some of their work time contributing to open source projects.
But understanding its benefits does not equate to understanding the open source model with its confusing array of options. Free does not equate to easy, and "many eyeballs" do not necessarily equal reliable or secure. Having a huge number of alternatives to choose from does not make the job any easier, and nor is there any guarantee that they are any good - which is where the distributions help by picking winners but possibly at the expense of creating new silos of their own as they diverge.
We'll be covering the new and emerging tools in the Hadoop ecosystem in the near future.