This isn’t your grandmother’s API permissions control layer…

I’m guessing your grandmother probably didn’t have an API permissions control layer, but if she did this wouldn’t be it.

This post is mostly about Nucleus, our name for the storage layer which drives the Total ReCal components. The only way to communicate with Nucleus is over our RESTful API. This comes as somewhat of a shock to some people who believe that the way to move data around is a batch script with direct database access, but I digress…

What I’m going to try to do here is summarise just how epically confusing our permissions handling system for Nucleus is, mostly for the benefit of Alex and myself who (over the next week or so) will be trying to implement this layer without breaking anything important. It’s really, really essential that we get this done before we start promoting the service because of a few simple reasons:

  • Data security is important, and we don’t want anybody being able to read everything without permission.
  • Data security is important, and we don’t want anybody being able to write all over the place without permission.
  • Changing this kind of thing on a live service is like trying to change the engine block on a Formula 1 car whilst it’s racing.
  • We need to be able to guarantee the system can hold up to DoS attacks or runaway processes hammering the APIs.
  • People are already asking for access to this data for important things, like their final year projects.

So, where to go from here? Let’s take a look at everything which will be going on in the finished version.

Server Rate Limiting

Even before the Nucleus code kicks in, the server is fine-tuned to avoid overloading from any IP address or hostname. Using a combination of the OS firewall and the web server configuration overall request rates and bandwidth usage is kept below thresholds to ensure that the server is never overloaded. Due to the RESTful nature of the API (in which each request must represent a complete transaction) we have no requirement to ensure server affinity, so if the load gets too heavy we can easily scale horizontally using pretty much any load balancer.

To keep the pipes clear for our ‘essential’ services we do maintain a whitelist of IPs which have higher (but still not uncapped) limits.

Key Based Access

The only way to access any data in Nucleus is with an access token, issued by our OAuth system. These come in two flavours, either a user token (which grants permission for a specific user), or an autonomous token (which is issued at an application level, and is ‘anonymous’). The very first thing that happens with any request is that the token it gives is validated. No token, no access. Invalid token, no access. Revoked token, no access. To keep things nice and fast we store the token lookup table in memory with a cache of a few minutes, since most requests occur in ‘bursts’.

Continue reading

Moving Forward

Over the past week we’ve worked tirelessly to perfect our timetable import code and we’ve now got a system that is working with real data. A select few students have now been given access to iCal feeds for both their timetables and their Blackboard assignments and the Library is hoping to have their Talis Keystone system in place very soon meaning we can start producing feeds of people’s book return dates.

Our next big job is to move away from bulk imports of data and instead start developing code that will go through and validate and verify events. So this could be looking for changes in the time of events or verifying that the right students are seeing the right events (in the event of a student changing course for example). With these changes logged we can then tackle one of the top requests that students have of the University and that is to be better informed of changes to their timetables.

The main timetables are produced by the Registry department however they aren’t informed if a lecturer is ill on a particular day, and in any case timetables aren’t updated currently until the following morning, so we’re planning on developing a tool for faculty offices to use so that they can make individual amendments to timetables when rooms need changing or lectures have been cancelled so that students can be informed sooner.

The logging of these changes will be important for Blackboard too. Certain schools and faculties like the idea of personalised assignment calendars however their own internal policies don’t allow staff to set deadlines inside Blackboard because deadlines may be changed by lecturers and senior staff aren’t informed. This is why the Computing School for example release a huge Excel spreadsheet of deadlines because it means only two people have access to change deadlines. We don’t want to be in a situation where we have to create individual departments their own tools to manage assignment deadlines, we’d prefer everyone used Blackboard and so with the ability to log changes to events what we could do is delay the update of the deadline in the student calendars for 24 or 48 hours giving senior staff a period in which to change it back to the original date or leave it (i.e. approve the change).

Our plan over the next few weeks is to perfect our API for querying events, give more students access to the their iCal feeds and also start developing the front end calendar application.

How we make things faster.

Today we’ve been playing around with our timetable parser to Nucleus connection and trying to work out why we were taking a projected 19 days to finish up parsing and inserting.

This was a problem of many, many parts. First up was Alex’s code, which was performing an update to the event on Nucleus for each one of the 1.76 million lines associating students with events. Great fun, since Total ReCal communicates with Nucleus over HTTP and our poor Apache server was melting. This was solved by using an intermediate table into which we could dump the 1.76 million lines (along with some extra data we’d generated, such as event IDs) and then read them back out again in the right order to make the inserts tidier. This reduced the number of calls to about 46500, a mere 2% of the number of things to do.

Next, we ran into an interesting problem inserting the events. The whole thing would go really quite fast until we’d inserted around 48 events, at which point it would drop to one insertion a second. Solving this involved sticking a few benchmark timers in our code to work out where the delay was happening, and after much probing it was discovered that the unique ID generation code I’d created couldn’t cope with the volume of queries, and (since it was time based) was running out of available ID numbers and having to keep running through its loop until it found a new one, taking around a second a line. Changing this to use PHP’s uniqid() function solved that little flaw by making the identifier a bit longer, meaning that the chance of a collision is now really, really small.

At the moment we’re running at about 33 inserts a second, meaning the complete inserting and updating of our entire timetable (at least the centrally managed one, the AAD faculty are off in their own little world) is done in a little over 20 minutes. We’ve had to turn off a couple of security checks, but even with these enabled the time little more than doubles and we’re currently not making use of any kind of caching on those checks (so we can get it back down again). There are also lots of other optimisations left to do.

A bit of quick number crunching reveals to me that we’re now running the process in a mere 0.08% of our original 19 days. Not bad.

What We’ve Been Up To

It’s all been a bit quiet on the Total ReCal front for the past week or so, but not because we’ve been quietly doing nothing. Instead we’ve been quietly working on the supporting systems which let Total ReCal do it’s thing without needing to handle every single aspect of time/space management, user authentication and who knows what else.

The first thing we’ve got mostly complete is our new authentication system, built around the OAuth 2.0 specification (version 10). For those of you unfamiliar with OAuth, it’s a way of providing systems with authorisation to perform an action without actually giving them a user’s credentials, much as modern luxury cars come with a ‘valet key‘ which might provide a valet with limited driving range, limited top speed and no ability to open the boot. In the case of the University we’ve come up with a service whereby a user (in this case a student or staff member) issues authorisation for a service to access or modify data stored within the University on their behalf.

Taking Total ReCal the example, the user would issue a key which allows Total ReCal to read their timetable, assessments data and library data (from which it can extract various events such as lectures, hand-in dates and book due dates).What it doesn’t give is permission to read personal details, to book rooms under that person’s authority, to renew library books or indeed anything else which requires a specific permission. In addition to this, Total ReCal never sees the user’s authentication information – it simply doesn’t need to because the key it’s been given by the user is authority enough to do what it needs.

We need OAuth for a variety of reasons. First of all, we were getting bored of having to write a whole new authentication system for every single application, and this makes our lives much easier. Secondly and more relevantly we want Total ReCal to be a demonstration of the Service Oriented Architecture way, showing that it’s possible to make use of small, focussed services which we bolt together as we need rather than monolithic applications which do everything, but don’t play nicely with other monolithic applications trying to do everything. Authentication is a key example of this since it’s something in common to almost every application. Thirdly, we want to be able to explore more ways of giving the user control and this is one of them. By relying on the OAuth authorisation route, users are given crystal clear information on what Total ReCal is, what it does, and how it intends to use their information. It’s then up to the user whether they want to use Total ReCal or not, and they can revoke the permission at any time. In future we hope to see lots more applications take this route, not necessarily just from within the University but also from outside.

Continue reading

The Total ReCal Plugins

A specific problem that the university faces is the aggregation, integration and publishing of ‘space-time data’; that is, data relating to the use of space (i.e. room bookings, geo-spatial location data) and time (i.e. timetables, event schedules, library book returns).

This project will address this problem by developing plugins for existing university systems that expose useful data which can then be aggregated into new web-based services. One of these web-based services will be a new calendaring system for students (initially, hopefully staff later).

All student’s calendars will comprise of three core layers; academic timetable, assignment deadlines and book return dates. We will create plugins for the three DMS ((Data Management System)) the University uses for these, Blackboard, SirsiDynix Horizon (HiP library portal), and our in-house developed timetable system.

Because we will have developed a standard for storing space-time data from these systems we are also going to create a number of other plugins for other systems so they can add to the datastore. These systems include WordPress, and providing the University has moved to version 2007 in time, Microsoft SharePoint.

Detailed here are our initial ideas as to how we intend to develop plugins for the systems to access their data.

Blackboard

One of the big motivations behind this project is that, as students, there is no easy way of finding out hand in deadlines for assignments, being informed if the deadlines change, and seeing the deadlines marked on a calendar alongside our academic timetables (so that we can realise that we’ve got one week not two until that deadline!). For example at the moment, the media faculty releases an Excel spreadsheet that mixes deadlines for every module for every year group which isn’t very useful if I’m trying to work out what has changed if a deadline is updated.

By September all faculties will be using Blackboard for detailing assignments. Many already are, and some have been for several years. When creating an assignment, there is an optional field that the academic can fill in to specify the deadline. Unfortunately, less than 10% of assignments created on Blackboard during the last academic year had anything in this field. Another problem we have is that a number of schools and faculties are making use of the Turn It In service (via a Blackboard plugin) and we have yet to investigate how Turn It In stores the data in Blackboard.

As we understand it, and we will have this verified, the license the University has with Blackboard allows us to develop on top of the Blackboard API and also access the underlying database (which is MS-SQL based). As neither Nick or I are particularly well versed in Java, and also the API doesn’t seem to give us access to the information we need we believe the route we should go down is to access the data straight from the database.

Therefore we will create a script that will be executed on a cron job that checks for new assignments in the Blackboard database, and verifies the date and time of existing assignments. Additionally we will try and enforce that academics must use the deadline field when creating assignments.

Horizon

Through work that we’re doing on our Jerome “un-project” we have a head start on the accessing data from Horizon. The University has invested in Talis Keystone which integrated with Horizon and abstracts our the data over a friendly REST/SOAP web service. Using the APIs we’re developing for Jerome we intend to access book return dates for individuals and publish these as one of the Total ReCal layers.

Academic Timetables

Back in November 2009 I was incredibly bored one night and I hacked around with our student timetables to create subscribable iCalendar feeds. The script works by screen-scraping our timetables (here is mine) and then interpreting the JavaScript on the page to produce an array of events which can then be turned into ics format.

Our timetable system was written in-house many years ago so we’ve got a lot of control over the output. For the time being we’re not going to completely replace the HTML version of the timetables but add in a new script that will generate the ics feeds along with the timetable renders (this happens on a cron job at 3am every morning).

WordPress and others

A side project of mine has been developing a system that can add location awareness to our online services. When you visit one of these services your IP address is sent to this system and then matched against a list of IP ranges for the University’s wireless and wired networks. The response, if you are on campus is the building that you’re in, which campus you’re on and whether you’re on a wired or wireless connection. If you’re not on campus then it will list your closest campus and where roughly in the world you are (using the MaxMind database).

We will develop a WordPress plugin that will query this system when someone creates a blog post on our blogs.lincoln.ac.uk platform and then push this information to Total ReCal. A hypothetically mashup we could then build with this data something like a heat-map of blog posts tagged “research” and overlay this on Google Maps so we can see where the most research blogging is going on at the University of Lincoln.

When we know the situation with SharePoint we can also plan for potential plugins for it too.