Graph databases - why so hard?
Graph databases are shaped the way we think, so why can't people get their heads around them?
Graph databases should be simple and intuitive. After all, we naturally think of things in terms of their relationships with other things. A seven-year-old child might draw one dot representing herself, one dot representing a teacher and a line representing a relationship (‘is my teacher'). This is a simple graph, easy to understand with not a row, column or table in sight.
But even their biggest fans admit that people have a hard time getting their heads around graph databases. Indeed, this lack of easy accessibility was a recurring theme at the Connected Data London event last week. Computing spoke to some of the participants to find out why.
A diverse family
A subset of the NoSQL family, graph databases (sometimes used synonymously with knowledge graphs, a term coined by Google) have recently begun to emerge from their academic niche due to a confluence of trends including big data, IoT, machine learning algorithms, social media and cloud computing. They excel where the relationships between data points are as important as the data itself. Among their more prominent use cases and users are KYC and fraud detection in banks, customer 360 in marketing, network analyses in IT, criminal investigations, investigative journalism, audit and data provenance in compliance, machine learning and recommendation engines, search engines - and of course Facebook and Twitter are basically graphs.
Graph databases go back a couple of decades, but there are various sub-species, each having taken a distinct fork during their evolution. Some, like Allegro, GraphDB and MarkLogic are rooted in semantic web technologies like RDF and triple stores and are universalist in approach, with each node (subject or object) and each edge (also called a predicate or a vertex) having its own unique identifier (URI).
Then there are property graphs (also known as labelled property graphs, LPGs) which allow nodes and edges to be labelled with metadata (see above). If Alice met Bob, the time and location where they met could be added as labels to the edge ‘met', and additional information about Alice and Bob could be appended to their nodes. With RDF-based graphs each label would need to be a separate edge. Examples of LPGs include Neo4J, TigerGraph and OrientDB.
Some graph databases run on a single machine, while some are designed to be distributed across multiple clusters. An early example of the latter was TitanDB by Aurelius. That company was acquired by DataStax in 2015, and elements of Titan have found their way into the company's DSE Graph offering which runs on Apache Cassandra.
There are graph databases best suited to computationally heavy statistical data crunching as back ends for machine learning models and others that are adept producing real-time answers to search queries. Then there are those that are designed for analytics and those created with operational processing in mind. And the different branches of the graph family come with their own tools. RDF-based graph databases tend to use SPARQL or GraphQL for querying Neo4J and derivatives use Cypher while in the Apache world you have Gremlin and the TinkerPop framework. Other databases have their own native query languages.
Is that a graph problem?
Adding to an already confusing picture (at least for those of us not already immersed in these tools) is the dividing line between a ‘graph problem' and a ‘relational problem', which is not always crystal clear.
"The practical answer is if I'm coming from a relational database as soon as my nested correlated sub-queries start becoming too slow, or if I'm doing a lot of nesting of joins, or I have recursive joins and they just aren't performing anymore. I have a graph problem," said senior director of product management at DataStax, Jonathan Lacefield, adding that most graph problems are search problems.
However, treat graph as the hammer to every problem's nail and you'll soon hit performance problems from the other direction, said Sabah Zdanowska, product manager - corporate data at AML compliance vendor ComplyAdvantage. In particular, poorly designed queries that traverse more of the graph than is necessary can take much longer to deliver than with an SQL equivalent. "In terms of the speeds that you would get in a relational database, in certain scenarios it can be much, much slower."
As ever with databases, it's a matter of picking the right tool for the job. Graph databases are great for analysing networks, hidden relationships, some pattern matching queries and not so great where the linkages between subject are less important. But what of the RDF versus property graph divide?
Ruben Verborgh is professor of decentralised web technology at Ghent University, Belgium, and also technology advocate for inrupt, a startup within the Solid ecosystem.
For him, a graph problem is where closed objects are too limited to approximate the complexity of the system you are trying to model. "The world is graph-shaped, it's not object-shaped", he said, adding that this requires a new way of thinking on the part of developers schooled in object-oriented programming; not that it's harder, just different.
A self-described ‘RDF guy', Verborgh concedes that property graphs
may be more intuitive for ‘normal people' since annotation means they can be more compact. However, because definitions are not universal, a central body is required to agree on the semantics and therefore they are best suited to projects where the scope is relatively narrow or the data is in one database. On the other hand, where queries span a large number of sources, as with a decentralised system like Tim Berners-Lee's Solid project, global definitions are required.
"With property graphs, the same query can mean different things on different databases," Verborgh explained. "Solid is a good example of where the other approach is better because it's not a big database, it's a big number of very small databases, or datasets actually. In that situation, you want to have a universal meaning for everything."
Joining the dots
RDF is not complex in itself, Verborgh said, but in attempting to represent the complexity of the world it can certainly seem so.
"The ultimate goal is to traverse everything in the world, but you don't have to put it together in one place. In that sense, RDF is tackling a more complex problem than property graphs, so that's important to know."
Nevertheless, there is a W3C project called Easier RDF with the goal of making the ecosystem ‘ easy enough for average developers (middle 33 per cent of ability)'. Another one called RDF* is looking to bridge between RDF and property graphs, with tools allowing annotation of nodes and edges.
So, property graphs are a simpler place to start? Not necessarily says DataStax's Lacefield, explaining that the landscape is characterised by siloes and there are unification efforts underway here too.
"Within property graphs there are two or three main APIs, so that's Gremlin and Cypher and some others, and there's an effort to come up with a unified language," he said. "But really, every time you're talking about something that's not SQL there's always a learning curve there. I think it's just the age of the graph database market."
As someone who advises banks, insurers and other businesses, Sabah Zdanowska sees the issue more from the UX point of view. From a usability standpoint, most graph databases are ‘still way off', she notes in a blog post. With some it's hard to import data, others are confusing to operate, and so on.
So graph databases can be challenging, but the database vendors have themselves to blame for making them seem more difficult than they actually are. Every vendor loves to wheel out a slide that looks like numerous plates of multicoloured spaghetti joined together by more spaghetti strands, she said. The intent is to show how well their tool can analyse the interconnectedness of everything, but for a marketing analyst who just wants to know if aggregated Alices and Bobs share the same taste in restaurants, or an executive who wants to know if a particular metric is rising or falling this is just off-putting.
"When I see those sort of visualisations I think you are not doing our peer group any favours," Zdanowska said, adding that vendors need to spend much more time thinking about and testing the user experience.
In conclusion, then, graph databases are neither new nor particularly complicated, although, as Zdanowska says, some of them do require experienced developers plus a lot of handholding to obtain the desired results. But it's still a relatively immature market and there are lots of different standards out there which have yet to be brought together in an easy to use way. However, there are graph databases out there that claim to be equally at home with analytical and operational workloads, and those like Stardog that boast of spanning the property graph - semantic web divide. As most offer free trials, the only real way to see if they are suitable is to pick a couple and give them a spin.
Delta is a market intelligence service from Computing created to help CIOs and other IT decision makers make smarter purchasing decisions - decisions informed by the knowledge and experience of other CIOs and IT decision makers.
Delta covers multiple technology domains and is free from vendor sponsorship or influence of any kind, and is guided by a steering committee of well-known CIOs.
Sign-up here for your free trial of the Computing Delta website.