Posted on October 14th, 2010 by Alex Bilbie
Now that we’ve got live data being produced from Blackboard and CEMIS we can start writing scheduled jobs to insert this data into the Total ReCal database however in the case of CEMIS we’re having a few problems.
Everyday a CSV file is created of all of the timetable events for each student. This file is (currently) 157mb in size and has approximately 1.7 million rows. In his last post, Nick explained that we have now developed an events API for our Nucleus metadata service which is going to be the repository for all of this time space data. Currently we’re able to parse each row in this CSV file and insert it into Nucleus over the API at about 0.9s per row. This isn’t good enough. As my tweet the other day shows, we need to significantly speed this up:
At the moment we’re simply streaming data out of the CSV file line by line (using PHP’s fgets function) and then sending it to Nucleus over cURL. Our two main problems are that the CSV file is generated one student at a time and so ideally needs to be re-ordered to group events by the unique event ID in order to improve performance by reducing the number of calls to Nucleus because we can just send the event with all associated students as one. Our second problem is parsing and building large arrays results in high memory usage and currently our server only has 2gb of memory to play with. We’ve capped PHP at 1gb memory at the moment however that is going to impact on Apache performance and other processes running on the server. Ideally we don’t want to just stick more memory into the machine because that isn’t really going to encourage us to fine tune our code so that isn’t an option at the moment.
Over the next few days we’re going to explore a number of options including altering the current script to instead send batched data using asynchronous cURL requests, and also then re-writing that script in a lower level language, however the second is going to take a bit of time as one of us learns a new language. Both should hopefully result in significantly improved performance and a decrease in execution time.
I’ll write another post soon that explains our final solution.
Posted on October 13th, 2010 by Nick Jackson
We’ve explained what Mongo and NoSQL is, and why we’re using it. Now it’s the turn of the actual data access and manipulation methods, something we’ve termed Nucleus.
Nucleus is part of a bigger plan which Alex and I have been looking at around using SOA principles for data storage at Lincoln, in short building a central repository for just about anything around events, locations, people and other such ‘core’ data. We’re attempting to force any viewing or manipulation of those data sets through central, defined, secured and controlled routes more commonly known as Application Programming Interfaces, or APIs.
In the past it would be common for there to be custom code sitting between services, responsible for moving data around. Often this code would talk directly to the underlying databases and provide little in the way of sanity checking, and following the ancient principle of “Garbage In, Garbage Out” it wouldn’t be unheard of for a service to fail and the data synchronisation script to duly fill an important database with error messages, stray code snippets and other such nonsense which wasn’t valid. The applications which then relied on this data would continue as though nothing was wrong, trying to read this data and then crashing in a huge ball of flames. Inevitably this led to administrators having to manually pick through a database to put everything back in its place.
Read the rest of this entry »
Posted on September 17th, 2010 by Nick Jackson
As part of Total ReCal we’ve been taking a look at the so-called NoSQL approach to databases. I gave a quick overview of NoSQL and why we were looking at it in a previous blog post, so I’m going to skip all the gory details of what NoSQL actually is (and why we’re using it), and leap straight into the discussion on if it’s any good, if it’s ready for prime-time, and if it’s ready for the HE sector to actually use in production.
Is it any good?
In a word, yes. In slightly more words, yes, but only if you use it in the right place. NoSQL is excellent at providing fast, direct access to massive sets of unstructured data. By ‘fast’ I mean ‘thousandths of a second’, and by ‘massive’ I mean ‘billions of items’. On the other hand, if you’re after rock-solid data integrity and the ability to perform functions like JOIN queries then you’re out of luck and you should stick to an RDBMS. The two approaches aren’t competing, but offer complementary functionality. A corkscrew and a bottle opener both let you into your drink, but it’ll be amazingly awkward to open your beer with a corkscrew.
Read the rest of this entry »
Posted on August 25th, 2010 by Nick Jackson
It’s all been a bit quiet on the Total ReCal front for the past week or so, but not because we’ve been quietly doing nothing. Instead we’ve been quietly working on the supporting systems which let Total ReCal do it’s thing without needing to handle every single aspect of time/space management, user authentication and who knows what else.
The first thing we’ve got mostly complete is our new authentication system, built around the OAuth 2.0 specification (version 10). For those of you unfamiliar with OAuth, it’s a way of providing systems with authorisation to perform an action without actually giving them a user’s credentials, much as modern luxury cars come with a ‘valet key‘ which might provide a valet with limited driving range, limited top speed and no ability to open the boot. In the case of the University we’ve come up with a service whereby a user (in this case a student or staff member) issues authorisation for a service to access or modify data stored within the University on their behalf.
Taking Total ReCal the example, the user would issue a key which allows Total ReCal to read their timetable, assessments data and library data (from which it can extract various events such as lectures, hand-in dates and book due dates).What it doesn’t give is permission to read personal details, to book rooms under that person’s authority, to renew library books or indeed anything else which requires a specific permission. In addition to this, Total ReCal never sees the user’s authentication information – it simply doesn’t need to because the key it’s been given by the user is authority enough to do what it needs.
We need OAuth for a variety of reasons. First of all, we were getting bored of having to write a whole new authentication system for every single application, and this makes our lives much easier. Secondly and more relevantly we want Total ReCal to be a demonstration of the Service Oriented Architecture way, showing that it’s possible to make use of small, focussed services which we bolt together as we need rather than monolithic applications which do everything, but don’t play nicely with other monolithic applications trying to do everything. Authentication is a key example of this since it’s something in common to almost every application. Thirdly, we want to be able to explore more ways of giving the user control and this is one of them. By relying on the OAuth authorisation route, users are given crystal clear information on what Total ReCal is, what it does, and how it intends to use their information. It’s then up to the user whether they want to use Total ReCal or not, and they can revoke the permission at any time. In future we hope to see lots more applications take this route, not necessarily just from within the University but also from outside.
Read the rest of this entry »