Make Incidents Visible and Informing the Shift Left

I am always looking for ways to help teams fight the good fight when it comes to cultural change across API operations. The technology of APIs takes a lot of work, but it never requires as much work as the business and people side of API operations. Gathering stories from across the conversation I am having, sharing them here on the blog, and via the Postman blog is how I look to scale my advice and guidance for shifting enterprise behavior. Today I wanted to aggregate a handful of stories about outages and breaches, and explore what we can do to contribute to behavioral change across teams and amongst leadership after any type of API incident.

We hear a lot about “shifting left”, moving security, testing, and other elements of the API lifecycle earlier on in the lifecycle. What we don’t hear as much talked about is how we handle availability and breach incidents and leverage learnings to inform team members, leadership, and make sure all future work and prioritization is informed by these incident learnings. It is common for teams and leadership to want to move on after an incident, but if you are looking to learn from things and truly shape future development, here are a few things to consider. 

  • Retros  - Always make sure and pause after an incident and ensure a retrospective occurs regarding everything that has occurred, gathering as many views as possible from across the team.
  • Document - The time to document the incident beyond just the retrospective, establishing  a record of incidents, distilling down learnings into a shared organizational memory.
  • Visuals - Produce visuals that helps communicate what happened. Including graphs, charts, icons, images, and other visually engaging content to support policies and documentation.
  • Stories - Take the time to gather the stories of those involved, and pull relevant quotes to further support documentation policies, visuals, and stories shared with leadership.
  • Policies - Develop new policies or iterate and expand existing policies, defining machine readable Spectral rules or other tests to help ensure an incident doesn’t happen again on any API.
  • Remember - Find ways to remember what happened until the necessary change occurs with APIs, operations, teams, and leadership, highlighting incidents all along the way.

I realize that nobody is going to feel like doing all of this after an outage or breach. It is merely a recommendation of things to consider. Without this evidence the incident will quickly fade from memory, it will be unlikely that those not involved with the incident will learn from what happened, and any technical debt involved in the incident will likely remain. Where investing in a shared understanding of what happened by those involved, ensure there is the proper documentation and visuals, as well as sharing of stories with other teams and leadership can help lead to the change you envision. It all can lead to policy change, more investment in reducing technical debt, as well as a stronger organizational memory regarding historical incidents—-which can translate into more learning from each outage and breach.

I will keep emphasizing how different elements of the API lifecycle are being shifted left, but I will also spend more time gathering stories about how you can recover from an incident, minimize the chances it will happen again, how you can share knowledge across teams, but also convince leadership to invest in whatever change is needed. This stuff isn’t easy. I am always trying to understand enterprise culture from the bottom-up and top-down, and as my guest on Breaking Changes, Gregor Hohpe talked about—-riding the architect elevator up and down. Then crafting the right message to shift behavior at all levels, learning from each incident, and incrementally moving API operations in the right direction. You won’t get everything you need, but the goal is to maximize every incident as fuel for the forward motion you desire.