I built a team of AI agents that fixes CI/CD failures while I sleep

Three weeks ago I had a thought that's been bouncing around my head for a while: what if your CI/CD pipeline didn't just tell you something broke — it actually fixed it?

Not a Slack alert. Not a PagerDuty page waking someone up at 2am. An autonomous system that detects the failure, reads the logs, figures out the root cause, opens a fix PR, and notifies your team — all before a human has even opened their laptop.

So I built it. I called it ops-pilot.

What it actually does

ops-pilot runs four AI agents in sequence, each with a single responsibility:

Monitor polls your CI provider (GitHub Actions, GitLab CI, Jenkins) for failed runs and builds a structured failure model from the log tail.

Triage sends those logs to Claude and extracts root cause, severity, and fix confidence into a typed output — not a blob of text, a structured object your system can act on.

Fix asks the model which file to edit, fetches it from GitHub, generates a minimal patch, and opens a draft PR. Humans review before anything merges. This is non-negotiable.

Notify writes a concise Slack message and posts it to the right channel with the PR link attached.

The whole pipeline runs in under 10 seconds per incident. The live demo is at ops-pilot.adnankhan.me if you want to see it in action.

ops-pilot also ships with Claude Code slash commands — /run, /triage, /add-pipeline, /new-provider, and /scenario — so engineers can interact with the system conversationally from their terminal without touching a dashboard or config file.

The architectural decisions that actually mattered

I've spent 5+ years as a Director of DevOps. I've seen a lot of automation projects that sounded great in a presentation and collapsed in production. Building ops-pilot forced me to be honest about where AI agents are genuinely useful and where they're still a liability.

Humans stay in the loop on merges — always.

Every fix the system generates lands as a draft PR, never auto-merged. This isn't timidity. It's the correct architecture for 2026. The agents are good at triage and suggestion. They are not yet reliable enough to push to main without review. Any system that skips this step is going to have a very bad day in production.

The provider abstraction matters more than the agents.

I built a CIProvider abstract base class with seven interface methods. GitHub Actions, GitLab CI, and Jenkins are all adapters behind it. This means the agent code doesn't care what CI system you're running — you swap the provider, not the intelligence. This is the kind of decision that separates a demo from a real platform.

Use open PRs as your deduplication source of truth.

Early versions of this kept a separate state table to track “already processed” failures. Then I realized: the PR itself is the state. If there's already an open PR for a failure, don't open another one. Git is your database. This eliminated an entire class of race conditions.

Simulation mode for the demo, live API for local testing.

The public demo replays three pre-recorded scenarios with zero API cost. This isn't a compromise — it's an architectural decision. A demo's job is to show the concept clearly and repeatably. Engineers who want to see the real thing clone the repo and run it themselves.

What surprised me about building with agent teams

The hardest part wasn't the AI. It was the scaffolding.

Getting Claude to triage a CI failure accurately is straightforward — feed it the log tail and the diff, ask for structured output, done. The hard part is building the environment around the agent so it can operate reliably without human oversight:

Tests are the brain, not the agents.

The agents are only as useful as your ability to verify their output. If your test suite is weak, the fix agent will confidently solve the wrong problem. I spent more time on the test harness than on any individual agent.

Context pollution kills agent performance.

If you dump 10,000 lines of CI logs into the context window, you get worse results than if you intelligently extract the 50 most relevant lines. Log curation is a first-class engineering problem, not an afterthought.

The Anthropic multi-agent compiler article, which inspired this project, makes a point I keep coming back to: the scaffolding is the product. The LLM is a commodity. What differentiates good agentic systems is the environment, the feedback loops, and the guardrails — not which model you call.

What this means for platform engineering teams

I want to be direct about something: this is not a threat to DevOps engineers. It's a force multiplier.

At Fabletics, we had hundreds of pipelines running at any given time — deploying simultaneously across dev, QA, and production environments. When failures came in, they didn't come in one at a time. They came in waves. A bad library upgrade or a flaky integration test could light up dozens of pipelines at once, and the DevOps team would spend hours just triaging logs across GitHub Actions and Jenkins to find the actual root cause — before they'd even written a single line of fix. At that scale, the bottleneck isn't expertise. It's bandwidth. No team is fast enough to manually triage hundreds of concurrent failures without things slipping through.

The engineers I've managed who are most excited about agentic AI are the ones who are already great at their jobs. Because they immediately understand what it means to offload the mechanical work — the 2am “test_payment_webhook is flaky again” page, the “missing pyarrow dependency” Docker build failure that blocks three teams — and redirect that cognitive load toward architecture, reliability strategy, and the problems that actually require human judgment.

The engineers who are nervous are the ones whose value proposition is being the person who knows how to fix things. That's a real concern. But the trajectory is clear: the definition of “fixing things” is shifting from “executing the fix” to “designing the systems that execute fixes.”

Directors and VPs hiring in 2026 should be asking: which of my engineers are building toward that future, and which ones are waiting to see if it arrives?

What's next

ops-pilot is an early proof of concept, not a production-ready product. There are real gaps: it doesn't handle merge conflicts in the fix PRs gracefully, the Jenkins provider is lightly tested, and the triage agent occasionally misattributes root cause when the failure is infrastructure rather than code.

But the architecture is sound, the pattern is validated, and the direction is right. The repo is public at github.com/adnanafik/ops-pilot — built with Claude (Anthropic), MIT licensed, contributions welcome.