Long Running API Requests And Differential API Responses

I am shifting my long running API operations from a PHP / EC2 based implementation to a more efficient Node.js / Lambda based solution, and I promised James Higginbotham (@launchany) that I’d share a breakdown of my process with him a month or so back. I’m running 100+, to bursts of 1000+ long running API requests for a variety of purposes, and it helps me to tell the narrative behind my code, introducing some coherence into the why and how of what I’m doing, while also sharing with others along the way. I had covered my earlier process a little bit in a story a few months ago, but as I was migrating the process, I wanted to further flesh out, and make sure I wasn’t mad.

The base building block of each long running API request I am making is HTTP. The only difference between these API requests, and any others I am making on a daily basis, is that they are long running–I am keeping them alive for seconds, minutes, and historically hours. My previous version of this work ran as long running server side jobs using PHP, which I monitored and kept alive as long as I possibly could. My next generation scripts will have a limit of 5 minutes per API request, because of constraints imposed by Lambda, but I am actually find this to be a positive constraint, and something that will be helping me orchestrate my long running API requests more efficiently–making them work on a schedule, and respond to events.

Ok, so why am I running these API calls? A variety of reasons. I’m monitoring a Github repository, waiting for changes. I’m monitoring someone’s Twitter account, or a specific Tweet, looking for a change, like a follow, favorite, or retweet. Maybe I’m wanting to know when someone asks a new question about Kafka on Stack Overflow, or Reddit. Maybe I’m wanting to understand the change schedule for a financial markets API over the course of a week. No matter the reason, they are all granular level events that are occurring across publicly available APIs that I am using to keep an eye on what is happening across the API sector. Ideally all of these API platforms would have webhook solutions that would allow for me to define and subscribe to specific events that occur via their platform, but they don’t–so I am doing it from the outside-in, augmenting their platform with some externally event-driven architecture.

An essential ingredient in what I am doing is Streamdata.io. Which provides me a way to proxy any existing JSON API, and turn into a long running / streaming API connection using Server-Sent Events (SSE). Another essential ingredient of this is that I can choose to get my responses as JSON PATCH, which only sends me what has changed after the initial API response comes over the pipes. I don’t receive any data, unless something has changed, so I can proxy Github, Twitter, Stack Overflow, Reddit, and other APIs, and tailor my code to just respond to the differential updates I receive with each incremental update. I can PATCH the update to my initial response, but more importantly I can take some action based upon the incremental change, triggering an event, sending a webhook, or any other action I need based upon the change in the API space time continuum I am looking for.

My previous scripts would get deployed individually, and kept alive for as long as I directed the jobs manager. It was kind of a one size fits all approach, however now that I’m using Lambda, each script will run for 5 minutes when triggered, and then I can schedule to run again every 5 minutes–repeating the cycle for as long as I need, based upon what I’m trying to accomplish. However, now I can trigger each long running API request based upon a schedule, or based upon other events I’m defining, leveraging AWS Cloudwatch as the logging mechanism, and AWS Cloudwatch Events as the event-driven layer. I am auto-generating each Node.js Lambda script using OpenAPI definitions for each API, with a separate environment layer driving authentication, and then triggering, running, and scaling the API streams as I need, updating my AWS S3 Lake(s) and AWS RDS databases, and pushing other webhook or notifications as I need.

I am relying heavily on Streamdata.io for the long running / streaming layer on top of any existing JSON API, as well as doing the differential heavy lifting. Every time I trigger a long running API request, I’ll have to do a diff between it’s initial response, and the previous one, but every incremental update for the next 4:59 is handled by Streamdata.io. Then AWS Lambda is doing the rest of the triggering, scaling, logging, scheduling, and event management in a way more efficient way than I was previously with my long running PHP scripts running as background jobs on a Linux EC2 server. It is a significant step up in efficiency and scalability for me, allowing me to layer on an event-driven layer on top of existing 3rd party API infrastructure I am depending on to keep me informed of what is going on, and keep my network of API Evangelist research moving forward.