As part of Total ReCal we’ve been taking a look at the so-called NoSQL approach to databases. I gave a quick overview of NoSQL and why we were looking at it in a previous blog post, so I’m going to skip all the gory details of what NoSQL actually is (and why we’re using it), and leap straight into the discussion on if it’s any good, if it’s ready for prime-time, and if it’s ready for the HE sector to actually use in production.
Is it any good?
In a word, yes. In slightly more words, yes, but only if you use it in the right place. NoSQL is excellent at providing fast, direct access to massive sets of unstructured data. By ‘fast’ I mean ‘thousandths of a second’, and by ‘massive’ I mean ‘billions of items’. On the other hand, if you’re after rock-solid data integrity and the ability to perform functions like JOIN queries then you’re out of luck and you should stick to an RDBMS. The two approaches aren’t competing, but offer complementary functionality. A corkscrew and a bottle opener both let you into your drink, but it’ll be amazingly awkward to open your beer with a corkscrew.
Which approach to use is, in some cases, a very clear cut thing – NoSQL is not going to be the best way to provide a central student management system, since it lacks the relational ability which is needed to easily organise, group and query this data. On the other hand it is the best way to quickly traverse large data sets such as a library inventory without needing to involve tens of tables
In some cases though the line will be blurred, and the best thing to do would be to run two systems concurrently. Facebook, for example, uses NoSQL to power some aspects of searching but an RDBMS to actually store the bulk of the data. Our Jerome un-project uses a NoSQL layer to provide rapid searching and querying, but the actual definitive source of information is a Sybase database and the two sources are regularly synchronised.
To summarise, NoSQL is very good at what it does but if you try to use it for the wrong thing then whilst it won’t be impossible it will at the least be more awkward.
Is it ready for prime-time?
In a word, yes. There are no slightly more words for this. NoSQL databases aren’t just a plaything for bored application developers, they are a serious way of doing things which powers bits of Google (BigTable), are supported by Amazon (Dynamo), does the heavy lifting for Digg voting and Facebook’s inbox search (Cassandra), runs FourSquare (MongoDB) and countless other bits and pieces across the world. Twitter uses a variety of NoSQL systems to tie things together, including its own (now open) FlockDB. In short, to say NoSQL isn’t ready for prime-time is nothing short of blindness.
However, there is a caveat. NoSQL is still relatively new to the field of databases, and whilst it has an enormous number of developers and an even bigger community it is nowhere near as mature as the world of SQL and RDBMS, which has remained relatively unchanged for longer than I’ve been alive. The lack of any clear standard in the NoSQL world (The “No” gives it away) means that the various systems do things in a myriad of different ways. MongoDB is a document store (it’ll blindly accept any data as long as the structure is valid), whereas something like memcache is a key/value store (one key = one value). Whilst it’s possible to map between these it’s a lot harder than moving between two RDBMS databases.
There’s also an issue in that NoSQL database systems are somewhat of a moving target, and are still being developed. In many ways this is a good thing since it means problems are spotted and resolved faster, and rapid iterations mean that new features appear very quickly. Conversely if you’re forced into a ‘build it, test it, leave it alone forever’ mentality then you simply won’t be able to keep up. Our choice of database (MongoDB) only gained its sharding functionality in the latest version, and its resiliency to failure (whilst already good) is still being tweaked and improved.
Is NoSQL ready for HE?
I’m going to have to turn this question around. I’ve already established that NoSQL as a concept is more than ready for production usage, but I have doubts that the HE sector is ready to move away from RDBMS any time in the next few years. Our own database team at Lincoln get visibly nervous at the notion of us using MySQL for some things instead of MS-SQL, let alone completely changing the approach to data storage.
I suspect that a lot of this quite rightly derives from the need of the more ‘central’ systems to provide solid data integrity, as I mentioned above I really wouldn’t want to try running a student management system on a NoSQL server. The trouble is that this mindset seems to extend even to new services which are fundamentally different to anything implemented before and it’s very difficult to undo (in some cases) 20+ years of experience with RDBMS and replace it with unstructured, untyped data storage.
There’s also a problem in that HE in general doesn’t have a need for NoSQL except from an academic, theoretical standpoint. Universities in the past haven’t built systems which try to amalgamate masses of data into one place in an unstructured format, and even the biggest of university libraries won’t be handling a billion inventory items.
What we’re trying with Total ReCal could be done in an RDBMS with a bit of tweaking to the design (we’d need to replace one ‘events’ collection with plenty of different tables to handle different event types and event to user mappings), and to be completely honest although it would suffer slightly in the performance stakes it would still do reasonably well. At this point, however, we’re off into uncharted territory since once we start adding masses of events (every timetable item, assessment due date and book due date for every student and staff member) we really don’t know how well NoSQL will perform against an RDBMS. Obviously we’ll keep you posted.
So, what’s the verdict?
We like NoSQL, and it’s the right thing to use in a lot of places. If you’re trying to perform lots of operations on a lot of data very quickly it’s a clear winner over an RDBMS, and I can see it becoming used more and more in HE as universities try to move towards SOA as a way of doing things, purely out of necessity for various services to play nicely together without slowing down the entire business process. For the moment, however, it’s not going to suddenly start replacing key services.
I’ll be following up shortly with a blog post looking at the various NoSQL methodologies and servers available, and explaining exactly how we arrived at using MongoDB over other solutions along with a handy flowchart to help you decide if NoSQL is right for you (and if so what you should look at using).