Sometimes Open Data Just Needs That Human Touch

I had a great conversation with Chris Taggart (@countculture), the CEO of OpenCorporates, after his keynote at APIDays Berlin / APIDays Europe 2015 last week, about how they get the data they need, to make the change they want to see in how we do business globally. Us technologists get excited about the potential around the ingestion of large data files published by companies and governments, and the scraping of data where we can. All of this is definitely be part of the open data lifecycle, but often times the whole open data thing, comes down to the human variable.

Almost every open data project I've embarked on, always has a portion of the process which involves me rolling up my sleeve, and cleaning up, and validating of data--there is just no way around it. Of course I do everything I can to automate the harvesting, mounting, processing, and cleaning up of data, but in the end most of the data I come across is just messy. This is just way things are, and while I'd love to not have to get my hands dirty, it is always a requirement to achieve my goal.

This is why it made me happy to talk through some of OpenCorporates processes with Chris, and learn that they have pretty similar approaches. They can definitely automate the extraction of files, and large of amounts of data, but when it comes to normalizing it, they often have to depend on putting the data into small batches, publishing to Google Spreadsheets, and rely on humans to do the detail work--yes, humans still have a purpose, when it comes to open data!

OpenCorporates tries to do as much of the heavy lifing as they can, but they still require on individuals, with as minimal amount of training as possible, to look at PDFs, HTML docs, and other files, and enter specific values into associated columns, and validate data that has been harvested--these Google Spreadsheets are then imported into the main system, that drives the site, and the OpenCorporates API. 

I want to point this our for two reasons, 1) opening up data is hard work, and there is only so much that we can rely on machines to do, and 2) this is a model that I feel we all can apply across other aspects of open data, not just opening up of corporate data. I'd like to see this applied to critical areas of city, county, and state government, and specifically the area of policing. I have ideas on how to federate this approach, something I'll share more when I have the time.

I just wanted to talk about the re-enforcement that I got from Chris, someone with a wealth of experience in the area, that compute ultimately will only gets us so far, and we need will legions of open data enthusiasts who can help make business and government a little more transparent. Are you with me? Let me know if you need an open data project, I have endless amounts.