We Let an AI Agent Say 'I Passed.' Was It Actually Good?

We Let an AI Agent Say "I Passed." Was It Actually Good?

We ran 10 controlled experiments to answer a question nobody seems to be asking: When an AI agent says the tests pass, does that mean the code is actually ready to ship?

There's a lot of noise right now about AI agents writing code. Faster commits. More output. Ship it yesterday.

But here's a question we haven't seen anyone answer with data: When the agent says "all tests pass," is the code actually good?

We ran the experiment. Here's what we found.

The Setup

We gave an AI agent the same task 10 times: build a Snake game from scratch using TDD. Same spec. Same model. Same prompt scaffolding. 20x20 grid, arrow key controls, collision detection, score tracking, retro aesthetic. The whole thing.

The only variable: 5 runs had CI pipeline integration. 5 didn't.

Both groups followed test-driven development. Both wrote comprehensive test suites. Both committed and pushed code. The CI-enabled group got one additional thing: real-time feedback from a CircleCI pipeline after every push. In our experiment, we created a RalphCI loop (i.e. a CI-enabled Ralph loop) that pre-fetches pipeline status and injects it into the agent's prompt context, so the agent can see and act on CI results as part of its development cycle.

That's it. One variable.

What Happened

All 10 runs finished. All 10 completed every task. All 10 produced a playable Snake game with full test coverage. All 10 passed local tests.

Every single agent said "passed."

Then we checked the CI pipeline.

	Without CI (5 runs)	With CI (5 runs)
Tasks completed	7/7	7/7
Local tests passing	✅	✅
Game fully playable	✅	✅
CI pipeline passing	1 out of 5	5 out of 5

80% of the runs without CI integration shipped code that fails the pipeline. The agent ran the tests on its machine, saw green, and declared victory. The pipeline disagreed.

What Went Wrong (and Why It's Instructive)

The failure was an ESLint configuration mismatch. The agent defaulted to single quotes. The project's linter required double quotes. Local tests don't check lint rules. The CI pipeline does.

This isn't a dramatic failure. It's a mundane one. And that's the point.

AI agents don't inherently know your project's lint config, your environment variables, your CI-specific test runners, or your integration-level constraints. Local tests verify logic. CI verifies deployability. These are different things, and the gap between them is where "works on my machine" lives.

The one non-CI run that passed? The agent happened to discover the project's lint:fix command on its own. Lucky, not reliable. Not a strategy.

What the CI-Enabled Agent Actually Did

When CI failed, the agent didn't just see a red X. The RalphCI loop pre-fetches the pipeline status and injects it into the agent's prompt context at the start of its next iteration. The agent sees something like:

CI Status

Pipeline #551 FAILED with linting errors
Issue: game.test.ts using single quotes instead of double quotes

Here's what happened next, across all 5 CI-enabled runs:

12 CI failures appeared in total
The agent fixed all 12 autonomously
Zero human intervention required

In every case, the agent dropped its planned task, diagnosed the root cause from the failure log, applied the fix, re-verified locally, pushed, and resumed its work once the pipeline came back green.

It refused to mark the PR as "Ready for review" until CI was green.

Or as Dan Lorenc put it: "Code is cheap. Green CI is priceless."

The Behavioral Shift

The most interesting thing in this experiment isn't the pass rate. It's how the agent's behavior changes when it has access to pipeline feedback.

Without CI, the agent's definition of "good" is: local tests pass. It has no reason to think otherwise. It runs the suite, sees green, declares victory, and moves on. It's not being careless. It literally has no other signal available.

With CI, the agent operates differently. It treats the pipeline as the source of truth. I watched it:

Query pipeline status before starting new work
Drop planned tasks to prioritize CI fixes
Parse failure logs to diagnose root causes
Re-verify both locally and through the pipeline before continuing
Wait for pipeline completion before declaring a PR ready

This is closer to how a senior engineer works than a junior one. Not because the model is smarter. It's the same model in both groups. The feedback loop gives it the information it needs to operate at a higher level.

An Unexpected Finding: CI Makes Agents Write Better Tests

CI-enabled runs consistently produced more tests: 28-37 per run versus 21-28 for non-CI runs.

We didn't instruct the agent to write more tests in CI mode. The prompt was identical. But when the agent operates within a feedback loop that includes external verification, it appears to write more thorough test suites on its own.

We don't want to overclaim causation from 10 runs, but the pattern was consistent across all 5 CI-enabled experiments. If I had to guess: agents test more thoroughly when they know their work will be checked by something other than themselves. Which sounds a lot like what humans do.

The Velocity Problem Nobody's Talking About

Everyone's focused on making AI agents faster at writing code. But code generation was never the bottleneck, and it definitely isn't now. The bottleneck is knowing whether the code you generated is actually safe to ship. As agents accelerate how fast code gets written, the gap between "code produced" and "code that's actually deployable" doesn't shrink. It widens. More code, generated faster, with the same (or less) visibility into whether it integrates, passes lint, respects environment config, and behaves correctly outside the agent's local context.

That gap is where our 80% failure rate lives.

CI is what closes it. Not by slowing agents down, but by giving them the signal they need to self-correct. The CI-enabled runs in our experiment took more iterations and more time. They also produced 100% deployable code. The non-CI runs were "faster" and 80% broken.

Speed without confidence isn't velocity. It's drift.

The teams that will move fastest in the agent era won't be the ones generating the most code. They'll be the ones with the tightest feedback loops between code generation and verification, the ones who can point an agent at a problem and trust the pipeline to keep it honest.

What This Means If You're Building with AI Agents

This experiment was small and controlled. A Snake game, not a production codebase. But the finding generalizes to a principle:

The feedback loop defines the output quality.

Same agent. Same task. Same model. Same prompt. The only difference was whether the agent could see what the pipeline saw. That single variable moved the CI pass rate from 20% to 100%.

If you're running AI agents against real codebases without CI in the loop, you're likely in the same position as our non-CI runs: the agent is confident, the code passes local tests, and you have no signal on whether it actually integrates.

Three things we'd suggest based on what we found:

Close the feedback loop. If your agent can't see CI results, it can't fix CI failures. This sounds obvious, but most agent setups today push code and move on. The agent needs to see the pipeline status and have the opportunity to act on it.

Treat systematic failures as tuning opportunities. The ESLint quote mismatch appeared in every single run. That's not randomness, it's a trainable pattern. Injecting your project's lint rules into the agent's prompt context, or adding a pre-commit hook, could eliminate the most common failure before it ever hits CI.

Invest in your pipeline, not just your prompts. The agent's ceiling is defined by the infrastructure around it. A better prompt might make the agent write slightly better code. A CI feedback loop changes what "done" means.

What We're Doing Next

This was our first controlled study with proper sample size and identical configurations across groups. We're planning to run the same experiment design against production-scale codebases with more complex CI pipelines, compare across different models, and explore whether smarter prompt context (feeding lint rules and failure patterns directly to the agent) can reduce CI failures before they happen.

We're also interested in what this enables when you run agents in parallel: if each CI-verified run produces a genuinely deployable artifact, you can spin up 10 variants of a feature, all tested, all passing CI, and pick the best one. That changes how you think about prototyping.

All 10 runs finished before lunch. Ship it. With receipts. 🚀

This experiment was conducted by the CircleCI AI Testing Lab using RalphCI, a soon-to-be open-source tool for integrating AI coding agents with CI pipelines. The full dataset (10 runs, per-iteration metrics, activity logs) will be published alongside the repo. Stay tuned.

We're sharing this because we think developers deserve data, not just opinions, on how to work effectively with AI agents. If you've run similar experiments, we'd love to compare notes.