Moving API Docs From Human-Readable to Machine-Readable

One of the super powers of APIs.json is the ability to evolve the human-readable aspects of API operations into machine-readable ones–as this is how we are going to scale to deliver the API economy all of us API believers envision in our minds eye. I saw what Swagger (now OpenAPI) had done for API documentation back in 2013, and I wanted this for the other essential building blocks of our API operations. A decade later I am still translating our getting started, plans, SDKs, road map, change log, and support into machine-readable artifacts as part of our API Commons work, but I am still working to translate documentation into machine-readable artifacts as well.

We have made huge strides when it comes to the adoption of OpenAPI across API providers. Stripe, Twilio, Plaid, GitHub, and many, many, other API providers maintain their own machine-readable OpenAPI artifacts that describe the surface area of their APIs. With APIs.json, I am aiming to do the same to the essential building blocks of API operations surrounding these APIs. However, to fully realize this vision we need ALL OF THE APIs to possess accurate and up-to-date OpenAPI definitions. Like APIs.json, ideally the API provider is the one who maintains their OpenAPI, but in the absence of this I am in the business of creating as many OpenAPI for APIs as I can, while also indexing the operations around them with apis.json.

I just wrote about how I am automating the discovery of new APIs across the rich category of tags I am developing with my APIs.json work. I was able to find new APIs using the Bing Search API, and using a fingerprint I developed profiling APIs.json, I am able to score each URL I find on the chances they have a public API or not. Now when I find an API, and I automatically discover that they have a documentation page for their API, I want to obtain a machine-readable OpenAPI definition for that API. Ideally, the API provider has provided an obvious link to their OpenAPIs or the GitHub where they manage them. I will be developing separate scripts for searching Bing and Github (using their APIs) for each tag plus “openapi” to discover new APIs who have published OpenAPIs, but for this post I am focusing on API providers who have not published an OpenAPI. Many API producers and consumers take all of the hard work that the community has put into Swagger and OpenAPI, and the transforming effect of open source API documentation renderers like Swagger UI and Redoc for granted. If I land on an API provider and they have Swagger UI or Redoc, I know that I am going to walk away with a machine-readable artifact for an API. If I don’t, the spectrum of possibilities are endless for how API documentation will be rendered. While there are some common CMS solutions, and information architectures applied when publishing API documentation, it is really difficult to automate the crawling, harvesting, and scraping of API documentation to produce a machine-readable OpenAPI that is accurate. I have written libraries that I can rapidly customize to navigate multiple pages, and point it at paths, parameters, schema, errors, and other elements I will need, but to date, I have not seen a single comprehensive solution that will suck in the HTML documentation for an API and give me a machine-readable API. Cue the people who tell me artificial intelligence will do this for us. It won’t. Until you bring me a model and demonstrate that you can do it across all of the APIs I have indexed, you can just fuck right off with your LLM. I am sorry, I believe AI will solve a lot of API problems for us, and help us automate a lot of things, but you don’t see the range of docs that I do, at the level I do. There is a lot of messy shit out there. I have more faith that we will get humans to adopt or at least auto-generate OpenAPI and APIs.json than I believe we’ll develop AI to unwind this mess. I could be wrong. I am plenty of times about these things, but until I see evidence, I will keep developing my own approaches to parsing API docs and producing OpenAPIs. The problem is, I suspect I am too close to the problem to actually provide a solution, but this won’t stop me from trying.

My current API documentation to OpenAPI crawling, harvester, and scraper relies on me configuring a series of “return between” scripts to parse each page, which can take from 5-10 minutes to configure for each API provider, depending on their approach. It works, but is time consuming. I will keep using it until I find a better solution. My evolution of that set of scripts builds upon the recent Bing search mechanism I wrote about before, but then for any individual property I identify as big a link to API documentation, I will harvest links from its page and keep crawling, but I will also parse paths, parameters, schema, errors, and other common patterns you find in API documentation. All of the elements will be spatially placed after each API path that is found. I will crawl probably 2-3 levels of pages, parsing paths, and rating each page for its API documentation fingerprint. Then I will distill down each page that has a high enough rating down into OpenAPI operations and supporting schemas. Establishing a more automated approach I can point at any human-readable API documentation page to reduce down into a machine-readable one. Ideally all of this work is done by API providers, or by artificial intelligence. I have been hearing this for the last decade, and it still hasn’t happened, so I am continuing to do the work to make it happen. I remember sitting at a dinner table in Palo Alto in 2013 with Stephen Wolfram and a bunch of O’Reilly media people explaining what Swagger was for and the first visions of APIs.json as I was working on data.json for the Obama Administration. I heard from several people that night that nobody would create these Swaggers and even if they did the work would be replaced by artificial intelligence shortly. Ten years later, I still don’t have all the machine-readable documentation for the APIs I am profiling with APIs.json. I am going to take another stab at automating the translation of API documentation into OpenAPI, and even if I am not entirely successful, I know that I will learn a lot along the way.