Towards one database to rule them all
Databases have become silos and this needs to change, says graph guru Marko Rodriguez
Every few weeks, or so it seems, a new database is added to an ever-growing and increasingly diverse list.
In addition to the familiar relational databases there are analytical, transactional, in-memory, NoSQL, NewSQL, time series, event streaming, graph, distributed, wide-column, key-value and document stores. Some databases straddle several of these categories, while others are finely tuned to serve a small set of use cases or a specific domain.
The explosion in database technologies, which began about ten years ago with the advent of NoSQL, has been accompanied by a rise in the number of query languages and interfaces, each specific to a certain class of database.
Processing engines are proliferating too, particularly given the popularity of real-time streaming as a way of moving data rapidly in and out of datastores, filtering, querying and aggregating data and managing low-level operations.
So, developers and data engineers are spoiled for choice - which is not necessarily a good thing.
In short, the at-scale data processing environment has become an archipelago of small islands of functionality, navigating between which is not for the faint-hearted, or at least not for those inexperienced in integration. In an environment characterised by rapid change, it's easy to see how this cornucopia of choice can also be a recipe for complexity.
But when complexity rears its many ugly heads, those dedicated to simplification and unity are not far behind.
Beyond the database
Graph database aficionados will be familiar with the name Marko Rodriguez. Rodriguez (pictured) was project co-founder, along with Stephen Mallette, of Apache TinkerPop, a graph framework that unites transactional and analytical processing, and the Gremlin graph querying language that runs on it.
"Apache TinkerPop took a decade of my life" he said. "Those years will prove to be my formative years, with all future work from here on simply grasping at a deeper, more pure realisation of what we discovered."
Having moved on from TinkerPop (more about that below), Rodriguez is now turning his attention to the wider problem of how to integrate these separate data ecosystem islands in a way that is seamless for the end user and sufficiently flexible to accommodate new developments as they emerge.
The project, called mm-ADT (multi model Abstract Data Type), is a virtual machine, much like the Java virtual machine (JVM) but for cluster computing, which connects any compliant database, query language or processing engine to any other.
For example, you might want to query a graph using SQL in real-time, or traverse a document store using Gremlin as a batch process both of which would be very tricky at present. mm-ADT aims to make all this possible without requiring any tinkering by the data engineers. It should ‘just work' because the dependencies and underlying complexity have been abstracted away.
"I believe that the future data space will look beyond the monolithic data system rooted at a database core," said Rodriguez. "Instead, the future may be more about naturally composing the diversity of piecemeal distributed technologies — languages, processing engines, storage systems."
Whereas some might see mm-ADT as a Frankenstein monster (Rodriguez's preferred term is ‘synthetic data system') the aim is to make the stitches invisible. Synthetic data systems will have far-reaching implications for data engineers, and indeed the wider world, Rodriguez believes, because they will allow for the easy creation of new structures and novel ways of interrogating data.
mm-ADT is about modelling one domain within another domain
In Pareto Principle terms, databases tend to miss out on edge use cases (the 20 per cent) a trade off for being optimised for the majority (the 80 per cent). Even multi model databases have a number of blind spots because they are restricted to working with a limited range of datatypes, yet it's within these niches that innovation frequently occurs. mm-ADT will break down these walls, enabling users to effectively create custom datatypes on the fly, Rodriguez claims.
"mm-ADT is about modelling one domain within another domain," he said. "Users define ‘morphisms' between, let's say, a graph composed of vertices and edges to a group of people related by various social contexts.
"This is what a schema is all about but mm-ADT takes it further. How can a graph be embedded in a wide-column store? How can a wide-column store be embedded in key-value store? How can a key-value store be embedded in a distributed, in-memory index system?
"All the while, under the hood, the VM maintains these mappings allowing the user to think completely in terms of people and their social links."
OK so how does it work?
mm-ADT translates between underlying structures and allows users to create custom data types and value. Almost every data structure can be represented using the mm-ADT, Rodriguez said.
"What is a key-value store? A stream of 2-tuples. What is a wide-column store? Multiple streams of 2-tuples with the second component having more structure. What is a graph database? A stream of vertex tuples with projections to streams of incoming and outgoing edge tuples.
"mm-ADT is tuples composed of streams that are being fed from various heterogeneous data sources in the cluster. That's pretty much all there is to it."
Well not quite all.
mm-ADT features a low-level assembly language called mmlang and it's planned that higher-level languages will compile to mmlang bytecode so that developers can continue to use their favourite tools to query and manipulate data. This is becoming an increasingly important part of the picture, Rodriguez said.
"The beauty of mm-ADT, expanding on our work at Apache TinkerPop, is that one language exists regardless of the underlying storage systems and processing engines that serve to ground the computation in the physical world, i.e. that serve to compute."
mm-ADT VM is analogous to the JVM but rather than enabling portability between devices and operating systems, it promotes agnosticism between database operations in distributed clusters.
"I hope we can create the first viable distributed programming language virtual machine akin to the JVM but where the ‘disks' are database nodes and the ‘processors' are the numerous CPUs across a vast array of machines," said Rodriguez.
"If users can program naturally with little concern for how data is moving across the cluster or how threads are being managed, then we will have an interesting product, indeed."
Predatory practices
If mm-ADT is a glimpse into the future of data processing, the chosen licensing model, the restrictive AGPL3, may be indicative of how developers feel it's necessary to protect their creations from the big cloud providers (see also MongoDB, Redis Labs and Confluent).
AWS said build it first and then we can talk about paying you
Rodriquez experienced this conflict first-hand while trying to raise funds to take Apache TinkerPop to the next stage.
"I contacted Amazon - a heavy user of TinkerPop both in their database product and internal for shipping - to ask for some funding to do TP4. They said: ‘Build it first, and then we can talk about paying you.' IBM had a similar response. I asked for a meagre amount from each, where the sum total was significantly under what a senior software engineer makes in the industry. I was devastated that the years I put into TinkerPop that has allowed them to make millions of dollars didn't make them grateful (or at least ‘respectfully' declining).
"It was an intense realisation for me. I decided to start a fresh new project that is AGPL to protect people from the major cloud vendors' predatory practices. That project is mm-ADT."