YouGov explains the secret of scalability with MongoDB
'Scale up before you scale out', advises executive technical director Jason R. Coombs
The polling and research agency YouGov, which collects consumer and political data in Europe, the US and Asia, was an early adopter of NoSQL database MongoDB.
Executive technical director Jason R. Coombs was a driving force behind that decision. YouGov was expanding globally and needed to replace its home-built key-value store with something better suited to distributed systems.
"I found myself espousing sharding before I truly understood what it was," he says. "I had been thinking about how to solve this non-relational data problem for years, then in 2010 when I saw MongoDB, the light bulb came on. I knew we had to use it."
The firm now operates two data centres in the US, two in Europe and "one global node that spans all of them for our bot", with MongoDB underpinning many of the company's operations.
Coombs explains: "Instead of having separate instances of MongoDB in each place we have one global cluster. Even though it looks to our applications like one uniform database management system in fact we have 20 different databases across production systems."
This makes things much simpler for the firm's developers because they don't need to worry about where the data is located. It also makes for lower latency as the data can be kept close to where it will be queried.
"Our database managers are putting the data into regionally relevant instances or shards," Coombs says.
YouGov uses the MongoDB Enterprise Advanced package, which includes support, and Cloud Manager to run, automate and back-up the deployment. The system underpins YouGov's solutions such as BrandIndex, the brand perception tool, and consumer behaviour tracker, Pulse. In all, it now supports dozens of applications within YouGov.
Because the data input into the system is typically a single document (i.e. a completed survey questionnaire), document-based sharding is a more convenient way to distribute and manage the data than using the relational model, which splits the data into multiple tables.
But what about what some see as the Achilles' heel of many NoSQL databases: that they do not allow for truly ACID transactions? To allow reads and writes in all partitions strong consistency must be sacrificed.
"It hasn't impacted us. We have panelists that come in and we need to assemble a page for them to view and write their responses to the database, but as far as consistency goes, it's a non-issue because we're able to read and write the entire document and that's where the user's entire experience is contained. That's the way it's designed," Coombs says, adding:
"It would be an issue if we were to try to treat MongoDB like a relational database, but that's something I'd advise against."
Bigger and better
The reason that MongoDB was selected in the first place was to get over some of the scalability problems inherent in relational databases. (YouGov experimented with Microsoft SQL Server before picking MongoDB.)
"I would say MongoDB's primary value is in performance. One of our use cases was with our IRC [Internet Relay Chat] bot that we were running using SQLite. We ran into scaling problems when we had 100,000 IRC logs. I spent two days rewriting it to work with MongoDB instead, put the bot on MongoDB and ever since then it's continued to run and scale without us even having to touch it. The performance was immediately noticeable."
YouGov may not have had a problem with scalability, but this has certainly been an issue with the database over the years, one acknowledged by MongoDB itself. A year or so ago, the firm purchased the WiredTiger storage engine to address the issue of performance at large scale. Coombs says that YouGov only recently upgraded to MongoDB 3.0, which features WiredTiger, but it managed to compress its data by 70 per cent using the storage engine - although in practice it still uses its legacy storage-layer compression to achieve the same result. However, he says, another benefit brought by WiredTiger is most welcome: collection-level locking.
"Something that troubled me with MongoDB was database-level locking. Any write on a particular shard would prevent any other write across the databases. Occasionally, if we had a bug or a surge of requests that surge would impact across all of our applications instead of just the one. Now with collection-level locking you could be writing to a particular collection and it doesn't prevent writes to other collections concurrently."
Coombs shares some advice to those looking to scale out using MongoDB: "Scale up before you scale out".
"There is a burden to partitioning your data even if MongoDB can abstract most of it, so use SSDs to maximise the hardware on a single server as much as you can and then scale out when you have to. That's the best bit of advice I've got around MongoDB management."