After looking at the initial brief for Total ReCal, we realised that it would be necessary to build a new data storage layer to handle the time/space information which drives the project. There are many reasons for this both technical and political, but the key reason is that since we are running what is effectively an abstraction and amalgamation service we want to be able to interface directly with our own copy of the data; here’s why.
Speed is often considered a luxury when dealing with large data sets, and especially in larger institutions it’s common to think nothing of waiting a few minutes for a report to finish building or for your operation to finish processing, but we wanted to offer something where you could happily hit it with 20-30 queries a second over an API. This is particularly relevant given our larger Nucleus un-project to expose public (and some private) data over APIs to allow mashups. In short, we don’t want to have to wait for even half a second whilst another service gets the data we’re after, and we especially don’t want to have to waste more time parsing the data into a useful format.
We looked at several possibilities for how to store the data. An obvious one to take a look at is a traditional RDBMS1 such as PostgreSQL or MS-SQL. In this instance we would most likely have been using MySQL, since it fits smoothly into the almost universally supported LAMP2 stack which is available on our key development server. Alex and myself are both well-versed in using MySQL as a database and interfacing with it using PHP, so should we have opted for an RDBMS it would be the obvious choice despite the rest of the University standardising on MS-SQL.
However, despite RDBMS being a more traditional and defined way of storing data (it’s been around since the early days of databases) and MySQL powering things like Facebook, YouTube, Wikipedia and even parts of Google there is a new way of doing things. NoSQL is a movement which promotes the use of loosely defined data types and non-relational storage methods with an emphasis on speed and resilience under stress. NoSQL has made big advances for its relatively tender age, with the concepts already used to drive Facebook’s inbox search, Digg, Reddit and more.
There are quite a few NoSQL style databases, including BigTable (from Google, powering big projects like Google Maps and Gmail) and Dynamo (from Amazon, which drives AWS3). These two are proprietary, but open-source offerings include Cassandra from the Apache Software Foundation and MongoDB by 10gen amongst many others. We’ve already started using Mongo to power Nucleus and Jerome and we’re particularly taken by its smooth scalability (including in-built sharding) and document storage (as opposed to key/value pairs).
Since we’re mostly reading from the database rather than writing to it, and doing very little in the way of manipulation, it makes a lot of sense to take the NoSQL tack as opposed to something emphasising data robustness (although despite Mongo defaulting to providing a BASE4 set of characteristics, atomic operations are available on a per-operation basis at a slight cost of speed).
Mongo also offers a huge bonus in that it is a document database, meaning that unlike a traditional RDBMS table we can store whatever we feel like within a database entry. This becomes particularly relevant when you take a look at the massive variation in the amount and type of information stored with an event or a location. A timetable event will have information associated which covers not only start and end times and location, but potentially also the class being taught, the students, lecturer, further reading, the location of slides on Blackboard and more. By contrast a deadline has no location (although some might want to specify places you can submit) or end time, but may want to include metadata on if the deadline is original, an extension, a resubmission or if it’s been moved. Locations similarly vary, with a lecture theatre potentially including everything from how many people a room seats through to how finely you can control the lighting whereas the location information for Student Support is basically limited to building, floor and opening times.
Obviously to represent this in a traditional table would either take hundreds of columns to capture all the available data or building our own key/value storage system on the top. By contrast in a document database we can just build the information as an array (or nested arrays) and save it, the DBMS will deal with optimisation and storage for us. We gain huge benefits in speed and performance, as well as the ability to perform queries against the data far more easily without resorting to complex JOIN statements.
So, in summary, we’ve opted for MongoDB – a document storage NoSQL database – on the grounds of improved speed (albeit at a potential loss of consistency in high demand situations) and flexibility in data storage (at a loss of any kind of database enforced referential integrity).
Total ReCal’s database will primarily be used to store the associations between events, their sources, and the relevant users or groups. The events and locations themselves will be stored in Nucleus and accessed exclusively via APIs, although their development should be considered part of Total ReCal with the side effect that we gain nicely defined event and location APIs.