After a few months break from Bath: Hacked after the event in March, we recently gathered to begin planning the two new events coming this year: the ‘Air Quality’ hack and ‘Bath: Hacked 2.1’ coming on Saturday 20th September and Saturday 1st November respectively.
We learnt a heck of a lot from planning the previous hack and it also helped us acquire a few more keen developers interested in data and shaping the direction of future events.
Enter the Datastore
The feedback we immediately received was that we lacked a proper datastore – somewhere we could put information where it was accessible, clean and ready to be used. Enter Socrata – a US base company specialising in open datastores which kindly gave us an account to play with. Two chaps from Socrata even attended our opening night in which we unveiled the platform at a micro-event on 7th August at the Guild. Since then I’ve had a good chance to play with the datastore and wanted to write a short piece about my experience with it.
The first thing you should know is that I know nothing/very little about datastores, JSON, APIs or getting them all working together beyond managing the databases that accompany WordPress installations in my daily role. However, I wanted to take a more active role on the Bath: Hacked committee beyond just running the website and so decided to dive in, head-first, into playing with the datastore and seeing if I could create something useful myself. This was partially so that I could use it at the next hack but mostly because I felt I could add value beyond what I had currently been doing.
My first interest was crime data. Don’t ask me why, but it just seemed like a glaringly big hole in the collection we already had and knowing that Police.uk had made the data accessible, I chose to begin with that. Going to data.police.uk told me that everything from December 2010 to June 2014 was currently available with monthly updates due to deliver the rest. After a brief read of the Socrata Knowledge Base, they suggested downloading and importing the data as a CSV manually before attempting an automated upload using their Publisher API. A short wait later and a bit of command line magic, I had a collection of CSV files of all crime data for the ‘Avon & Somerset Constabulary’, merged into one giant CSV. I then created a dataset in Socrata (note: you need to be a member of the datastore in order to do this), filled out the metadata (the info that explains what the data is, what license it has, etc.) as best I understood at the time and began uploading the CSV. The finished result was a rather crude table layout of lots of useful and useless information that I could then create a map view of to show the rough locations of the crimes. Socrata makes this part very easy so I was immediately quite please with my work, even though anyone with basic computer skills could accomplish the same.
My next task was to start getting the imported data into something that was more useful. The CSVs that Police.uk provide go through an anonymisation process prior to being made available so a couple of the columns (crime ID, outcome, etc.) came blank. The location of the crimes came split in two cells, one for latitude and one for longitude and I had to tell Socrata which cells to use. Apparently there were better ways to do this so thought this part could be improved. The main issue however, was the lack of a unique identifier for each row which I knew would be key in keeping the data up-to-date without overwriting rows by accident. The crime ID would have made a great candidate but it wasn’t available through the CSV downloads. It was however, available via the Police UK API. This was my next challenge.
I opened my giant CSV, removed all data apart from the column headers that I wanted, saved it and re-uploaded it to to my dataset to remove all of the data and give myself a blank canvas to start from. With the help of Tom Fletcher (not on the committee but an attendee of the first Bath: Hacked and subsequent micro-events) we sat down and put together a basic script using PHP that would grab all the data from Police.uk via their API and publish it to our Socrata dataset using their Publisher API. Socrata again, make this easy by providing a handful of simple libraries to put, post and get data from a dataset using a selection of popular languages.
The hardest part was trying to get the data from Police.uk. Although the API was easy to use, it didn’t allow direct access to crimes in the ‘Avon & Somerset Constabulary’ like the CSV downloads did and so we had to open Google Earth, draw a polygon that was as accurate as we had the time/effort for around BANES, export the result as a KML, open it in a text editor and pull out the points of the area. After formatting this into a useable URL and a little poking around trying to work out where our datastore and then correct dataset was, we managed to ‘put’ a single row of test data up. Hurrah! The Socrata documentation was a little lacking for this part but it’s been reported and hopefully, they’ll update their docs soon.
As the Police.uk API doesn’t require authentication, connecting to it and grabbing the data was very straight-forward. We then grabbed what we wanted, set our script to add it to the column IDs that we had already set-up in Socrata using the blank CSV and performed an initial data ‘put’. A few minutes later and a ‘Done!’ message in our browser, we checked the dataset and low and behold, all the crime data was there and in the correct columns – Crime ID, Month, Crime Category, Location (correctly formatted) and Street Name.
At this stage, there were two things that needed doing. First we needed to set the ‘API endpoint’ to have a unique identifier. In a nutshell, this allows you to choose which column should make each row unique to the dataset. This allows future updates to avoid overwriting your data and instead, will append it to the dataset. We set this to Crime ID. We then set the month to be in the correct (British) format of DD/MM/YYYY and we were golden.
A quick change of ‘put’ to ‘post’ in our script meant that we could let Socrata work out what was new and what was old data when new updates were published to the Police.uk API (which although says updates should be monthly, they do in-fact seem to be quarterly). This script is set to run monthly via a cron at my host to grab any new data and publish it to the dataset.
You can fork/copy/download our BANES crime data script over on GitHub.
The BANES crime dataset is available to view here: https://data.bathhacked.org/Government-and-Society/-Live-BANES-Street-Level-Crime/e46f-mhfs
Bicycle Shop Data
Hot-off-the-heels of this and armed with fresh knowledge, I chose to tweak the script to grab something that I was after – Bicycle Shop Locations in BANES. I began with the idea that I would start and curate a list of these but laziness and the thought of having to manually keep it up-to-date got the better of me, so I headed for Google Maps instead. I signed-in, registered an application in the Google Developer Console and switched on the APIs for Google Places and Google Maps Engine. This then gave me access to JSON output for map searches and a little playing with the URL gave me a list of ‘Bike Shops’ in ‘Bath & North East Somerset’ (an area Google Maps actually recognises). Grabbing just the bits I was after, I was able to publish them to a newly created dataset which I had made moments before and that was it!
The only problem i’ve noticed with this is that some locations are listed twice in Google Places and so this is reflected in the dataset. For example, some locations are obviously registered as a business address and some are listed as their actual location as Google has probably scraped this from elsewhere. These then look like the shop in question has two locations, when they don’t. I haven’t decided what to do with this yet. There may be a better source for the information but for the purposes of my learning and a quick and dirty grab for the next Bath: Hacked event, I thought this was acceptable.
My BANES bicycle shop script is also available to fork/copy/download over on GitHub.
The BANES bicycle shops dataset is available to view here: https://data.bathhacked.org/Transport/-Live-List-of-Bicycle-Shops-in-BANES/2nwi-wwk2
I encourage anyone, no matter what level you are, to go and play with the datasets on the Bath: Hacked datastore. You don’t have to be a techie to be able to create map, chart, calendar or graph views that can make greater sense of some of the vast amounts of information that we have been given access to and if you’re technically minded or have an affinity with numbers or patterns, your input could be incredibly valuable.
Now, go play!