How we make things faster.

Today we’ve been playing around with our timetable parser to Nucleus connection and trying to work out why we were taking a projected 19 days to finish up parsing and inserting.

This was a problem of many, many parts. First up was Alex’s code, which was performing an update to the event on Nucleus for each one of the 1.76 million lines associating students with events. Great fun, since Total ReCal communicates with Nucleus over HTTP and our poor Apache server was melting. This was solved by using an intermediate table into which we could dump the 1.76 million lines (along with some extra data we’d generated, such as event IDs) and then read them back out again in the right order to make the inserts tidier. This reduced the number of calls to about 46500, a mere 2% of the number of things to do.

Next, we ran into an interesting problem inserting the events. The whole thing would go really quite fast until we’d inserted around 48 events, at which point it would drop to one insertion a second. Solving this involved sticking a few benchmark timers in our code to work out where the delay was happening, and after much probing it was discovered that the unique ID generation code I’d created couldn’t cope with the volume of queries, and (since it was time based) was running out of available ID numbers and having to keep running through its loop until it found a new one, taking around a second a line. Changing this to use PHP’s uniqid() function solved that little flaw by making the identifier a bit longer, meaning that the chance of a collision is now really, really small.

At the moment we’re running at about 33 inserts a second, meaning the complete inserting and updating of our entire timetable (at least the centrally managed one, the AAD faculty are off in their own little world) is done in a little over 20 minutes. We’ve had to turn off a couple of security checks, but even with these enabled the time little more than doubles and we’re currently not making use of any kind of caching on those checks (so we can get it back down again). There are also lots of other optimisations left to do.

A bit of quick number crunching reveals to me that we’re now running the process in a mere 0.08% of our original 19 days. Not bad.

To NoSQL or not to NoSQL?

As part of Total ReCal we’ve been taking a look at the so-called NoSQL approach to databases. I gave a quick overview of NoSQL and why we were looking at it in a previous blog post, so I’m going to skip all the gory details of what NoSQL actually is (and why we’re using it), and leap straight into the discussion on if it’s any good, if it’s ready for prime-time, and if it’s ready for the HE sector to actually use in production.

Is it any good?

In a word, yes. In slightly more words, yes, but only if you use it in the right place. NoSQL is excellent at providing fast, direct access to massive sets of unstructured data. By ‘fast’ I mean ‘thousandths of a second’, and by ‘massive’ I mean ‘billions of items’. On the other hand, if you’re after rock-solid data integrity and the ability to perform functions like JOIN queries then you’re out of luck and you should stick to an RDBMS. The two approaches aren’t competing, but offer complementary functionality. A corkscrew and a bottle opener both let you into your drink, but it’ll be amazingly awkward to open your beer with a corkscrew.

Continue reading

Why NoSQL?

After looking at the initial brief for Total ReCal, we realised that it would be necessary to build a new data storage layer to handle the time/space information which drives the project. There are many reasons for this both technical and political, but the key reason is that since we are running what is effectively an abstraction and amalgamation service we want to be able to interface directly with our own copy of the data; here’s why.

Speed is often considered a luxury when dealing with large data sets, and especially in larger institutions it’s common to think nothing of waiting a few minutes for a report to finish building or for your operation to finish processing, but we wanted to offer something where you could happily hit it with 20-30 queries a second over an API. This is particularly relevant given our larger Nucleus un-project to expose public (and some private) data over APIs to allow mashups. In short, we don’t want to have to wait for even half a second whilst another service gets the data we’re after, and we especially don’t want to have to waste more time parsing the data into a useful format.

We looked at several possibilities for how to store the data. An obvious one to take a look at is a traditional RDBMS ((Relational Database Management System)) such as PostgreSQL or MS-SQL. In this instance we would most likely have been using MySQL, since it fits smoothly into the almost universally supported LAMP ((Linux, Apache, MySQL, PHP)) stack which is available on our key development server. Alex and myself are both well-versed in using MySQL as a database and interfacing with it using PHP, so should we have opted for an RDBMS it would be the obvious choice despite the rest of the University standardising on MS-SQL.

Continue reading