The Problem

Tribal knowledge is one resignation away.

At a previous job, Redis took us down in production more than once. Not the same exact failure every time — but the same class of problem. Connection pool exhaustion under traffic, manifesting slightly differently across services.

Each time, we triaged it, fixed it, and wrote something in Confluence days later when everyone was halfway through the next sprint. The reason we eventually got a handle on it wasn't the Confluence docs — it was tribal knowledge. Whoever was on-call remembered the last time it happened and knew where to look.

That's a terrible system. It's one resignation letter away from losing everything you learned the hard way. And the Confluence search bar can't save you — it matches keywords, not meaning. “Redis connection pool exhaustion” and “connection pool timeout” don't share a single token, but they describe the same problem.

retro-pilot closes that gap. Every incident triggers an autonomous post-mortem. Approved documents embed into a vector store so the next incident starts with the three most semantically similar past ones already in context.

Architecture

Seven specialists. One orchestrator.

Hierarchical multi-agent system. Typed Pydantic contracts between every boundary. An LLM-as-judge reviewer enforces quality before anything gets published.

Orchestrator

Triggers & coordinates

Fires the moment an incident is resolved. Pulls the 3 most semantically similar past post-mortems from ChromaDB, spawns specialists, enforces isolation — the log agent can't touch metrics, metrics can't read Slack.

Evidence Crew

4 agents, in parallel

Log collector, metrics querier, git history reader, and Slack scanner. Each scoped to one data source, returns a typed evidence bundle. No cross-contamination.

Synthesis Chain

Timeline → RCA → Actions → Writer

Four linked specialists: TimelineBuilder reconstructs what happened, RootCauseAnalyst determines why, ActionItemGenerator proposes changes, PostMortemWriter assembles the doc.

LLM-as-Judge

Quality gate + revision loop

An EvaluatorAgent scores the draft against a 5-dimension rubric. Low scores trigger revision — bounded to 3 cycles max to prevent runaway. Nothing publishes without a human approving.

See It Run

Live demo.

Three pre-recorded scenarios showing the full pipeline — from incident trigger to published post-mortem. Semantic retrieval against a seeded knowledge base demonstrates how past incidents contextualize new ones.

retro-pilot demo — hierarchical multi-agent post-mortem pipeline in action

Launch live demo ↗

Getting Started

Three commands.

Clone, configure, docker compose up. ChromaDB runs in its own container; no separate setup.

Clone the repo

git clone https://github.com/adnanafik/retro-pilot
cd retro-pilot

Configure your secrets

Copy .env.example to .env and fill in your Anthropic API key plus any data source you want to connect (GitHub, Datadog, Slack, Loki — all are optional and retro-pilot adapts to whichever you provide).

cp .env.example .env
# edit .env:
# ANTHROPIC_API_KEY=sk-ant-...
# GITHUB_TOKEN=ghp_...           (optional — for git history)
# DATADOG_API_KEY=...             (optional — for metrics)
# SLACK_TOKEN=xoxb-...            (optional — for Slack scans)

Run it
```
docker compose up
```
Trigger a post-mortem by posting an incident-resolved webhook to POST /incidents, or run python -m retro_pilot.demo inside the container to walk through a pre-recorded scenario end to end.

FAQ

Common questions.

Who is this for?

Platform, SRE, and DevOps teams whose incident volume has outgrown their post-mortem discipline. If you're running 5+ incidents a week and your team is either skipping post-mortems entirely or writing them days late with half the context lost, retro-pilot turns the process from expensive-and-optional into automatic-and-reliable.

Why semantic search? How does it actually find similar past incidents?

Confluence and Notion search by tokens — the literal words in your query. Two engineers describing the same incident a month apart will use different words, and keyword search misses the connection. retro-pilot embeds every approved post-mortem via sentence-transformers into ChromaDB. A new incident's summary is embedded the same way and a cosine-similarity lookup returns the top-3 most meaningfully similar past incidents — regardless of vocabulary overlap.

What's LLM-as-judge and why does it matter?

A dedicated EvaluatorAgent scores each draft post-mortem against a 5-dimension rubric: timeline accuracy, root-cause depth, action-item specificity, clarity, and factual grounding in the evidence bundle. Scores below threshold trigger a revision cycle. Without this gate, agent output quality drifts silently — you'd be approving bad post-mortems without realizing they'd gotten worse. The judge keeps the Writer honest. Bounded to 3 revision cycles max to prevent runaway loops.

Does it auto-publish post-mortems?

No. Every approved draft lands in a human review queue. A human hits "publish" before anything is embedded into the knowledge base or posted to the team. This is the same design choice as ops-pilot: agents are excellent at structured synthesis; they are not yet reliable enough to be the final authority on your incident history.

Is my data sent anywhere?

Only to Anthropic for the LLM calls that run the agents. No analytics, no telemetry back to me, no third parties. retro-pilot is fully self-hosted — your incident data, logs, metrics, and Slack threads never leave your infrastructure except for the Claude round-trip. Anthropic's data policy applies to that traffic.

How does this compare to incident.io or FireHydrant?

Those are full-stack incident-management platforms — on-call rotation, status pages, runbooks, AI copilots, the whole thing. retro-pilot is laser-focused on the post-mortem layer and designed to plug into whatever incident tool you already use. It listens for an incident-resolved webhook and produces the document. No rip-and-replace: use retro-pilot to upgrade your post-mortem pipeline without touching the rest of your stack.

Like it? Star the repo.

Open source, MIT licensed, built with Claude. Contributions welcome.

⭐ Star on GitHub ↗ Read the deep-dive →