Covering more bases: DataStax and MongoDB move to spread their influence in enterprise

New storage engine for Mongo and addition of graph database Titan to DataStax reveals multi-model trend in NoSQL

Two of the leading NoSQL database providers today announced developments that are intended to broaden the capabilities of their respective offerings.

DataStax, distributor of the Apache Cassandra NoSQL database, has announced its acquisition of the graph database firm Aurelius for an undisclosed amount. Aurelius is a small company of eight engineers that is behind Titan, a graph database that has already found a place in Cisco's big data stack.

DataStax will be offering Titan combined with its commercial DataStax Enterprise (DSE) package, which includes Apache Cassandra, search and analytics. The new package is to be known as DSE Graph and will only be available to paying customers. Work will start on integrating Titan into DSE immediately, with an announcement about its general availability expected later this year.

Graph databases are a relatively new addition to the ever-growing range of solutions now available in the data management area. They are particularly useful in analysing highly connected data. In terms of querying data via joins between tables (or documents in the case of NoSQL), graph databases lie at the opposite end of the spectrum from NoSQL. In NoSQL queries typically involve very few joins or none at all, whereas for applications supported by standard SQL databases, such as CRM, queries typically involve a few tens of joins at most. By contrast, graph databases allow efficient querying across many thousands of joins, making them ideal for interrogating highly connected data, as in a social media dataset, for example.

However, graph databases are difficult to scale across clusters of servers, as Matthias Broecheler, who was managing partner at Aurelius today's announcement, explained.

"The [statistical] distribution of graph data is very highly skewed. Let's imagine you want to build Twitter's primary storage engine. There are a lot of users that have 20 or so followers, then you have Justin Bieber and Barack Obama with millions of followers ... having to accommodate this kind of skew has been a huge problem because most data storage technologies centre around normally distributed data," he said.

Another issue is that for efficient querying you want strongly connected components (e.g. Mr Bieber and his followers) to be close together on the same machine rather than randomly distributed across the cluster.

Some of these problems can be overcome by running Titan on top of a distributed NoSQL platform. Broecheler said that Cassandra has proved itself easier to deploy practically at scale than its competitors.

"You can actually play these things out by adding machines relatively painlessly and you can run cross data centre, you can run it East Coast to West Coast, Tokyo and Europe and it works," he said. "That ultimately is what we are interested in: building a massive graph distributed over the entire planet in a way that from a user's perspective is just adding more machines to it."

For Datastax, the acquisition allows it to expand its use-case footprint as well as offering an alternative approach for existing customers.

"The use cases our customers have for graph are very consistent with the use cases that they're already using DSE for," said Martin van Ryswyk, EVP engineering, citing fraud detection, recommendation engines, network impact analysis, logistics and network and device management.

"The second part we're hearing pretty loudly is that they want a multi-model platform."

Rather than ingesting data into Cassandra and creating a graph in batch mode, he said, customers want to be able to do this in real-time. While Titan can currently work with Cassandra, it is not sufficiently integrated to allow such real-time performance yet, and "tightly coupling" Titan, Cassandra and the search capabilities of DSE is what the DSE Graph offering will be all about, van Ryswyk said.

Write to us, says MongoDB

DataStax competitor MongoDB announced today that it is to offer the WiredTiger storage engine as an option with its forthcoming 3.0 release. MongoDB acquired the specialist storage vendor of the same name in December last year, again for an undisclosed amount.

Users of version 3.0 of the NoSQL database, which will be made generally available early in March, will be able to select from three storage engines: WiredTiger; an updated version of Mongo's standard MMAPV1 storage engine; and an experimental in-memory engine.

Adding the pluggable WiredTiger engine as an option will, the company said, strengthen the capabilities of the database under write-heavy workloads such as messaging, log analysis and Internet of the Things (IoT) applications, while the in-memory option will speed up certain tasks where the data can be held in RAM. As with Datastax and Titan, offering a choice of storage engines moves MongoDB further along the road to multi-model functionality.

Improving write performance is important for the company. MongoDB has long been popular with developers - the firm reports 10,000 downloads of its software every day - but some users have reported performance problems as deployments scale up, especially under such write-intensive workloads, as Computing reported last year.

Kelly Stirman, director of products at MongoDB, acknowledged that this has been a chink in the database's armour.

"There has been a reputation that MongoDB is hard to scale. It's incredibly easy to get started building applications on a laptop but when you want to use a 100 computers then it's been more complicated than we would like," he said, inferring that the 3.0 release will rectify the issue.

"This release is all about scalability and performance at scale. The performance for writes will be as good as any other solution out there," Stirman said, claiming that for typical applications users can look forward to a five- to seven-times increase in throughput rates.

The company is also introducing new compression algorithms, which Stirman said will see a saving in the storage footprint of between 50 and 80 per cent for typical deployments, depending on the data.

"Data is growing faster than storage is growing cheaper," Stirman said.

Coinciding with the release will be Ops Manager, a new on-premise version of the cloud-based MongoDB Management Service (MMS). This application is designed to simplify the operational management of MongoDB deployments at any scale, including backup, upgrades, configuration changes, backup and point-in-time recovery operations and integration of common third-party operational tools.

While the new storage engines and compression libraries will be available to all users of the open source database, Ops Manager is restricted to MongoDB's paying customers. The same is true of many of the new security features that are essential for enterprise use.

"With 3.0 we've added comprehensive auditing capabilities," Stirman said. "We've been working with banks and the federal government to build to write security features that can make MongoDB comply with demands of these applications. That's an area where the commercial product is different from the open-source product. Pretty much all the things we asked people to pay for are in the area of security and management."

A multi-model future

Commenting on the news, Matt Aslett, Research director, data platforms and analytics at 451 Research, said:

"Each of these announcements could be considered significant in its own right. In combination, however, they indicate a new stage in the evolution of NoSQL and a clear signal that the future of NoSQL will be driven by database products that support multiple data models."

Aslett continued: "Multi-model momentum may have been growing for years but the fact that the commercial providers behind the two most popular NoSQL databases have detailed their plans to go multi-model confirms that the multi-model approach is the future of NoSQL."