An Open Source Sitemap API For Every Domain
23 Sep 2019
I invest a lot of resources into spidering domains. I spend probably about $50.00 to $75.00 a month in compute resources to scrape domains I have targeted as part of my API research. This is something that will increase 3x in the next year as I expand my API search and discovery efforts. While it is very valuable to use GitHub and Bing API to uncover new and interesting domains, they don’t always give me very meaningful indexes of what each site contains. I prefer actually spidering the site and looking for meaningful words from my vocabulary like Swagger, OpenAPI, Documentation, etc. All the words the help me understand whether there is really an API there or just a blog post mentioning an API, or other ephemeral reference of API. As I work to refine and evolve my API search tooling I find myself wishing that website owners would own and operate their own sitemap API, providing a rich index of every public page available on their website.
I should probably begin to adopt Common Crawl, instead of running my own scraper and crawler—however, the overhead of setting up Common Crawl and the special types of searches I’m conducting has prevented it from every occuring. I’ll most likely just keep doing my home grown version of harvesting of domains, but I can’t help dream and design the future I’d like to see along the way. Wouldn’t it be nice if EVERY domain had a base API for getting at the sitemap? Letting the domain owner control what is indexed and what is not, while also providing a simple, alternative, and machine readable interface for people to get access to content. Hell, I’d even pay a premium to get more structured data, or direct API access to the complete index of a website. It seems like something someone could cobble together and standardize using the ELK stick, and wrap it in a simple white label API + documentation.
In my mind every domain should have a developer.[domain].[extension] subdomain setup with a suite of default API services available, with the first one being a sitemap API. Next there should APIs and / or Atom feeds for all news, blogs, press, forums, and other common streams of information. Don’t make me scrape your blog to just make my own Atom feed. Don’t make me rely on your Twitter stream for this either. Having an API for the website sitemap and requiring scrapers to register and use a dedicated search endpoint will help reduce traffic and load on a primary domain website, and give domain owners more visibility into who is pulling information from your website. Sure, not everyone is going to respect this, but there many of us who will, and would be very thankful for having a well defined index of your website for helping understand your operations.
In the end I know this is a fantasy. Most website operators don’t understand this world, and just rely on Google to get things done for them. Most aren’t going to see the value in providing a dedicated API for accessing the content on their website. I know many of the linked data folks still believe in this wet dream, but i have too much experience to believe that website owners will ever understand and care at this level. However, it can’t stop me from dreaming and writing up as a story, and talking about the future I would like to have. Who knows, maybe someone will take my idea and turn it into a viable service, and actually begin offering for the more progressive domains, changing how the web is indexed and accessed. I don’t have time to do things like this. I barely have enough time and resources to be cobbling together and operating my scrape scripts for about 10K domains, grabbing what I need from across the companies, organizations, institutions, and government agencies I am monitoring as part of my API research.