Posted on 03-16-2013
Audrey came to me last night and said she had a project that she wanted to tackle, using the CrunchBase API. She wanted to pull a list of education startups that were founded in 2010-2012, showing their investments, CEO, founders and other related company information.
A couple weeks ago, I helped Audrey download a PHP Twitter bot, reverse engineer and make it work for an objective she had around Tweeting random responses to a certain type of tweet. I figured she was ready to see what it took to hack on an API and get the research data she needed.
Audrey wanted to pull a list of startups that were educated related. We started with the /search endpoint, using the query keyword “education”. The query parameter appears to search the title, tags and overview of each Crunchbase entry, pulling way more than what we needed, 5700 in total. Each search returned 10 startups, with a handful of fields for each company, not nearly enough information.
To prove we could get what we needed, we took the first startup returned, grabbed the name and used it to perform a company /entity search, returning a wealth of information. We had what we were looking for including date founded, amount raised, CEO, founders as well as description, logos, addresses and other vital company information.
Back to the /search endpoint. Each search only returned 10 results, when there were actually 5,700. I looked at the documentation and the only parameter listed, was “query”. No way to control how many results returned, paginate or otherwise. So I start guessing, appending page=, and max=, maxrows=, anything to get more results. Eventually I was able to change the page, so I was able to write logic to loop through each page, getting at all the results required.
For each returned result from the /search endpoint, I would take the company name and perform a company search on the /entity endpoint. However each call to the endpoint would yield a 301, redirecting me to another URL. So the search for:
Get a 301 returned, with the following redirect URL:
So for each /entity endpoint I would have to make two separate cURL calls to get a single record for a company. An approach that is guarantee to get CrunchBase into the API billionaires club (as chatty APIs do), or at the very least generate a nice bill with their API management provider.
Next, I needed to create a table to store the data in, I looped through the JSON object and spit out a list of fields, so I could generate a SQL script to create the table. There were a few duplicate fields, but quickly got a table up and the data inserted. The data is decent. There are a lot of issues with absence of default values, empty objects with available images, addresses, etc. But nothing I’m not used to wading through, to get what I need from an API.
With the /search endpoint based upon a single keyword search, that spans title, overview and tags, the search returned a lot of companies that didn’t fit our search. As we prepared to enter each record into a database, we would apply further filters, making sure there companies founded in 2010 or greater, and met other criteria, before actually inserting.
Now we had a MySQL table filled with 772 education startups. I exported as CSV, uploaded as a Google Spreadsheet, and shared it with Audrey. Game over. She had what she needed. Overall it took about 3 hours start to finish, which she felt was tedious, and feeling there was a lack of clarity regarding the whole process--something she couldn’t have done with me and my API programming skills. This is a problem CrunchBase.
Audrey is your target audience. This needs to be dead simple. She shouldn’t have to blindly navigate the API endpoints, documentation, data structures and holes in the data. These are things I’m used to having to cobble together, but an analyst like herself, won’t have the time and patience.
I can tell the CrunchBase API was created by someone who didn’t give a shit about the data, let alone use the API to build anything. I saw a job posting from CrunchBase, so I know they are looking for someone to own the API. I hope this person can become a steward for this great set of data, and make it into a meaningful API--there is a lot of potential there.
Quick feedback for whoever takes the job, regarding some building blocks to focus on:
- Endpoints - Establish more meaningful names for the existing endpoints, while establishing some new ones
- Documentation - Get the documentation complete!! So what if you have I/O docs, if all the fields aren’t defined
- POST / PUT - Al
- low folks to add and update data either after they curate and add to it, or from their own databases
- Blog - Create a blog, that lets me know someone is home. Its so evident nobody is.
- Social - Provide me with an active Twitter, LinkedIn and Google+ account to interact with you on and share what I’m doing with your data
- Branding - I need some guidance with branding of CrunchBase data, when I publish in the wild
- Embeddable - Holy shit, the opportunity with badges, buttons and widgets around CrunchBase data. This data is made for crunching and sharing in stories
- Spreadsheet - From the I/O docs, provide export as CSV and to Google Spreadsheet. Build Google Spreadsheet templates that allow users to enter their key into cell and it will pre-populate a spreadsheet with the data they are looking for, with the analysis, visualizations and other tools analysts will need
CrunchBase is a goldmine of data. As it is, I can make use of the data and generate value. But the API won’t remain at the center of that, I will extract the value I need and not return much to the platform, because well...it sucks ass. I can’t write back, and I do so much work to get the value I do, I feel like its mine after I’m done. Especially if the platform has no person or personality that I consider an owner or steward. I don't feel beholden to give anything back. Just take...not good for your business.
You guys have a lot of potential within the CrunchBase API, I hope you dedicate the resources to make it work, and find a steward for the API that will actually care about what you have, and interact with your API consumers in a meaningful way.
comments powered by Disqus
Winning in the API Economy
|Download as PDF|
Latest Blog Posts
- My Discussion Today With 6 Hypermedia Leaders At API-Craft in Detroit
- Getting To Know Jørn Wildt For The API Craft 2014 Detroit Hypermedia Panel
- Hypermedia Feels Like We Are Still Learning To Communicate With APIs
- Getting To Know Markus Lanthaler For The API Craft 2014 Detroit Hypermedia Panel
- Getting To Know Kevin Swiber For The API Craft 2014 Detroit Hypermedia Panel
- Getting To Know Steve Klabnik For The API Craft 2014 Detroit Hypermedia Panel
- New Indix API KickStart Program Reduces Costs For Developers
- Getting To Know Mike Kelly For The API Craft 2014 Detroit Hypermedia Panel
- A Shared, Distributed Experience(Metrics) Layer For The API Driven Application Stack
- Showcasing Your API Integrations With Other Platforms
- Increasing The Focus On APIs In Higher Education Is Important
- Getting To Know Mike Amundsen For The API Craft 2014 Detroit Hypermedia Panel
- The New StrongLoop API Server Provides A Look At Future Of API Deployment
- Models For API Driven Startups Built Around Public Data
- Will You Add Me To API Evangelist And How To Spot The Cool Kids
- When I Remix APIs Using Swagger How Do I Deal With Authentication Across Multiple APIs
- It Takes A Team Of Evangelists To Raise An API
- Support For Only Two Creative Commons Licenses In The API Commons
- Machine Readable Terms of Service Didn't Read Applied To APIs Via APIs.json
- API Deployment For Non-Developers Using Zapier, Google Docs, and APISpark
- State of Hypermedia Today @ API Craft In Detroit
- Need A Formal API Standard For Your Government Agency? Fork 18F's, And Make It Your Own!
- CORS Makes Your API Portable And Remix-able
- Chief Data Officer Needs To Make The Department Of Commerce Developer Portal The Center Of API Economy
- An API Definition As The Truth In The API Contract