Web Harvesting to API with Import.io
I had a demo of a new data extraction service today called Import.io. The service allows you to harvest or scrape data from websites and then output in machine readable formats like JSON. This is very similar to Needlebase, a popular scraping tool that was acquired and then shut down by Google early in 2012. Except I’d say Import.io represents a simpler, yet at the same time a more sophisticated approach to harvesting of web data and publishing than Needlebase.
Using Import.io you can target web pages, where the content resides that you wish to harvest, define the rows of data, label and associate them with columns in table you where the system will ultimately put your data, then extract the data complete with querying, filtering, pagination and other aspects of browsing the web you will need to get at all the data you desire.
After defining the data that will be extracted, and how it will be store you can stop and use the data as is, or you can setup a more ongoing, real-time connection with the data you are harvesting. Using Import.io connectors you pull the data regularly, identify when it changes, merge from multiple sources and remix data as needed.
Put The Data To Work
Using Import.io you can immediately extract the data you need and get to work, or establish an ongoing connection with your sources of data and use via the Import.io web app or you can manage and access via the Import.io API--giving you full control over your web harvesting connections, and the resulting data.
When getting to work using Import.io, you have the option to build your own connectors or explore a marketplace of existing data connectors, tailored to pull from some common sources like the Guardian or ESPN. The Import.io marketplace of connectors is a huge opportunity for data consumers as well as data scraping junkies (like me) to put their talents to use building unique and desireable data harvesting scripts.
I’ve written about database to API services like EmergentOne and SlashDB, I would put Import.io into the Harvest to API or ScrAPI category--allowing you to deploy APIs and machine readable datasets from any publicly available data, even if you aren’t a programmer.
I think ScrAPI services and tools will play an important role in the API economy. While data will almost always originate from a database, often times you can’t navigate existing IT bottlenecks to properly connect and deploy an API from that data source. Sometimes problem owners will have to circumvent existing IT infrastructure and harvest where the data is published on the open web. Taking it upon themselves to generate the necessary API or machine readable formats that will be needed for the last mile of mobile and big data apps that will ultimately consume and depend on this data.