To NoSQL or not to NoSQL?

As part of Total ReCal we’ve been taking a look at the so-called NoSQL approach to databases. I gave a quick overview of NoSQL and why we were looking at it in a previous blog post, so I’m going to skip all the gory details of what NoSQL actually is (and why we’re using it), and leap straight into the discussion on if it’s any good, if it’s ready for prime-time, and if it’s ready for the HE sector to actually use in production.

Is it any good?

In a word, yes. In slightly more words, yes, but only if you use it in the right place. NoSQL is excellent at providing fast, direct access to massive sets of unstructured data. By ‘fast’ I mean ‘thousandths of a second’, and by ‘massive’ I mean ‘billions of items’. On the other hand, if you’re after rock-solid data integrity and the ability to perform functions like JOIN queries then you’re out of luck and you should stick to an RDBMS. The two approaches aren’t competing, but offer complementary functionality. A corkscrew and a bottle opener both let you into your drink, but it’ll be amazingly awkward to open your beer with a corkscrew.

Continue reading

Why NoSQL?

After looking at the initial brief for Total ReCal, we realised that it would be necessary to build a new data storage layer to handle the time/space information which drives the project. There are many reasons for this both technical and political, but the key reason is that since we are running what is effectively an abstraction and amalgamation service we want to be able to interface directly with our own copy of the data; here’s why.

Speed is often considered a luxury when dealing with large data sets, and especially in larger institutions it’s common to think nothing of waiting a few minutes for a report to finish building or for your operation to finish processing, but we wanted to offer something where you could happily hit it with 20-30 queries a second over an API. This is particularly relevant given our larger Nucleus un-project to expose public (and some private) data over APIs to allow mashups. In short, we don’t want to have to wait for even half a second whilst another service gets the data we’re after, and we especially don’t want to have to waste more time parsing the data into a useful format.

We looked at several possibilities for how to store the data. An obvious one to take a look at is a traditional RDBMS ((Relational Database Management System)) such as PostgreSQL or MS-SQL. In this instance we would most likely have been using MySQL, since it fits smoothly into the almost universally supported LAMP ((Linux, Apache, MySQL, PHP)) stack which is available on our key development server. Alex and myself are both well-versed in using MySQL as a database and interfacing with it using PHP, so should we have opted for an RDBMS it would be the obvious choice despite the rest of the University standardising on MS-SQL.

Continue reading