How to hack Hadoop (and how to prevent others doing it to you)
All you need is Zenmap and Terminal on a default distro
Hacking Hadoop is a surprisingly simple process - possible with freely downloadable software - due to the open source data analytics framework's propensity to be distributed with no security features.
As demonstrated by Teradata and MapR at the Teradata Partners 2015 conference in Anaheim, California today, hacking Hadoop can be achieved in mere minutes, and is a salutary warning for IT managers to add essential security mechanisms to their distributions.
"How well does Hadoop security work out of the box?" asked Teradata senior security architect Chris McVey.
"Appropriate users [are meant to have] access only to data that belongs to them. But in actuality, we found that any user can have access to any of that data."
McVey explained that between 2006 and 2009, before Hadoop had officially launched, any user with their own data cluster could impersonate another by "trivially bypassing" security, adding that since then little has changed.
"From 2010 and on, up to the present day, by default most distros across the board are shipped with no security enabled," said McVey.
"Many security controls are bolted on - it wasn't until the last four or five years that security was a concern."
McVey opened a session of security scanner software Zenmap and, pinging an IP address, he immediately discovered all the usual ports commonly associated with Hadoop. "If it's this easy running it on one node, it's trivial to run it on an entire subnet," he observed.
Beyond scanning the ports, Zenmap also discovered clusters after a couple of minutes, finding .hwx extensions that were easily identified as Hadoop clusters.
Next, Hadoop distribution firm MapR's senior architect, Peyman Mohajerian, took over the demo.
Logging in through MacOS's Unix-based Teminal client as "mruser", Mohajerian issued a "curl" http-based data transfer command. The server returned no access rights to those particular files, but using "httpfs" in his command line - which accesses an API to communicate with the backend in Hadoop, and is used, Mohajerian observed, in Cloudera, among other Hadoop distributions, Mohajerian saw better results.
Mohajerian put "mapr" as the username in this command line even though he was logged in as "mruser", but the httpfs-equipped command meant the system trusted him anyway. He was then able to access files outside his technical user rights.
As solutions, McVey stressed the use of software to prevent pinging of ports, and software to make sure each movement a user makes within Hadoop has to be individually authenticated.
"Put clusters behind secure zones", said McVey.
"One way you can do that within Hortonworks is with Samsung Knox - it will redirect your queries to the various systems on your behalf. We suggest putting that in a demilitarised zone, and another layer beyond that."
This means only one port can be used to even get into the gateway Knox has created. McVey acknowledged, when asked by an audience member, that other distros apart from Hortonworks may not support this method, however.
The second method involves using the Kerberos protocol - effectively "ticketing" each access to a server node.
"What does it mean to say I've Kerberosised the cluster?" asked McVey.
"If I want to communicate with other services, I need to have Kerberos in order to do that. It also authenticates one service with all the others. So, for example, if Hive (a Hadoop subproject) needs to talk to HDFS (Hadoop Distributed File System - a distro that runs on commodity hardware), they also need to authenticate with each other.
"A Kerberos ticket will be required to access every single service," McVey confirmed.
An interesting session, showing both how simple it is to completely undermine Hadoop's stock security features, or lack thereof, and how theoretically simple it should be to better protect your distro.