← Back to writing

The engineer who “just knows” is one resignation away. I built the fix.

At a previous job, Redis took us down in production more than once.

Not the same exact failure every time. But the same class of problem — connection pool exhaustion under traffic, manifesting in slightly different ways across different services.

Each time, we triaged it. Fixed it. Wrote something in Confluence days later when everyone was already halfway through the next sprint. And moved on.

The reason we eventually got a handle on it wasn't the Confluence docs. It was tribal knowledge — whoever happened to be on-call remembered the last time it happened and knew where to look.

That's a terrible system. It's one resignation letter away from losing everything you learned the hard way.

That's the gap I built retro-pilot to close — a multi-agent AI system that runs post-mortems so your institutional knowledge outlasts the people who built it.

→ See it in action: retro-pilot.adnankhan.me


Production incidents don't care what caused them.

Redis connection pool exhausted under traffic. TLS certificate expired and took down the service mesh. Database query regressed after a schema migration. Third-party API went down and cascaded through your services. Bad feature flag rollout caused silent data corruption.

All of them resolve the same way: someone pages in, triages fast, patches the immediate bleeding, and closes the incident. Then life moves on.

retro-pilot triggers the moment an incident is resolved — regardless of cause — and a team of AI agents asks three questions: what happened, why, and what do we change so it doesn't happen again?


Here's how the agents divide the work:

An OrchestratorAgent coordinates everything. It spawns specialist agents — one pulls logs, one queries metrics, one reads git history, one scans Slack threads. Each agent is scoped and isolated: the log agent can't touch metrics, the metrics agent can't read Slack. No cross-contamination, clean typed outputs at every boundary.

From there: a TimelineBuilder reconstructs what happened, a RootCauseAnalyst determines why, an ActionItemGenerator defines what changes, and a PostMortemWriter assembles the final document.

Then an EvaluatorAgent — running LLM-as-judge — scores the draft and sends it back for revision if it doesn't pass. Maximum 3 cycles. Nothing gets published without a human approving it.


retro-pilot doesn't search past incidents by keyword. It searches by meaning.

“Redis connection pool exhaustion” and “connection pool timeout” don't share a single token. But they describe the same class of problem. ChromaDB with sentence-transformers finds that match. Keyword search — and your Confluence search bar — doesn't.

So when a new incident comes in, the OrchestratorAgent retrieves the 3 most semantically similar past post-mortems before analysis even starts. Your new incident is contextualized against your institutional knowledge — automatically, every time, whether or not the engineer on-call was there for the last one.

The Confluence problem isn't that teams don't write post-mortems. It's that the knowledge doesn't compound. Every incident starts from scratch.

retro-pilot makes knowledge compound.


A note on the two-project arc:

ops-pilot (project 1) is a multi-agent system that watches your CI/CD pipeline and reacts in real time to build and deploy failures. retro-pilot is standalone — it works for any production incident regardless of cause. But the connection is intentional: ops-pilot could trigger retro-pilot automatically after a deploy-caused incident resolves. The handoff writes itself.

→ Live demo: retro-pilot.adnankhan.me

→ Code: github.com/adnanafik/retro-pilot


ops-pilot fights the fire.

retro-pilot makes sure you never fight the same fire twice.

How does your team handle institutional knowledge when the engineer who “just knows” leaves? I'm curious how common the tribal knowledge problem actually is.