We’ve got several live datasets in our data store, including feeds from the air quality sensors, the car park gates, and the availability of bikes on the Next Bike stands around the city. Everyone loves a live, regularly updating dataset. They let you build dashboards and other visualisations that can help people make immediate decisions. But, as we’ve found over the last year or so, live data offers it’s own challenges.
While I’m writing this, and as you can see in the picture above, the Next Bike stand on Moorland Road is damaged. Rack number five has red tape around it which means you can’t leave a bike there. The data set is currently showing the maximum capacity of the stand as seven, when in fact it’s actually six. This is a problem if you’re looking for somewhere to leave you rented bike.
At the end of last year a section of Avon Street car park, covering 414 parking spaces, was closed due to vandalism. The data set still reported the full actual capacity, so the parking apps were showing incorrect data during this period.
The carbon monoxide sensor on Windsor Bridge occasionally seems to glitch and show recordings of 10B ppm, suggesting the bridge is embedded in a solid block of carbon.
And sometimes the real-time information screens at the local bus stops are missing a bus.
Live and real-time information is based on sensors and equipment that can be accidentally or deliberately broken. When the data is available to only a few people, perhaps those involved in looking after the car park or reporting on air quality, then the impacts are likely to be quite small. There are likely to be existing processes to handle these issues, such as issuing a press release, closing off affected areas and, increasingly, reporting via social media.
But when that information is published as open data made available to through live feeds and applications, then the impacts of poor or incorrect data might be larger. This is a side effect of opening up live and real-time information that many data publishers may not have considered.
How do you handle this type of issue? We’d suggest there are several ways:
- As developers you need to be aware that real world data is often messy and broken. Be prepared to deal with weird outliers. We can’t simply remove all suspect data from the store because it might actually be useful. For example, what if we want to report on sensor reliability?
- As data publishers you should produce documentation that describes what quality control processes you carry out. Here’s an example for our parking data. This can help build confidence in the data and set expectation
- Data publishers and developers need to keep one another informed about significant problems or changes. Whether it’s via social media or tools like Slack, its important for the publishers and consumers of data to share information so that any urgent issues can be addressed
- Data publishers might also want to support historical analysis of data by adding annotations to your data to explain outliers and add context. One way to do this is to keep a separate data set that records significant events, allowing someone to do some quality control or cleaning at a later date
At Bath: Hacked we’re doing the first three of these already. But, as we move into having nearly a years worth of historical data for some of our live datasets and an increasing number of mobile applications reporting on parking data, we’ll need to think about annotating some of our data.
What approaches are you taking to deal with the live data you’re publishing or consuming?