Hardening RalphCI loops for open source after the February 2026 study

The Self-Healing CI-Aware AI Coding Loop That Automatically Fixes CI Failures. That is what RalphCI is: a CLI-orchestrated loop that keeps the model tied to real CI pipeline state, swaps in a CI Doctor to perform code surgery when the pipeline is bleeding red, runs a deterministic Review Gate before push, and refuses to treat the run as finished until the remote build is green, not only local tests.

If you are landing here cold: when we say "the agent's work is done", we mean "the CI pipeline is green", not only "the local tests pass on your laptop".

Here’s how it all fits together, step by step. The February study already showed the problem: local tests might pass, but without CI in the loop, pipelines often fail; when CI is part of the loop, results reliably match what ships. This experiment is about putting that hardened, CI-driven loop into the wild under an MIT license: tightening rough edges and preserving the multi-agent structure that keeps each role clear.

Code is on GitHub. Video walkthrough: Loop Lab demo on YouTube.

Hypothesis

I expected we could keep the behavior we measured in the February study, address issues that only show up under long runs, and ship RalphCI as a small CLI that other people can run without a custom one-off setup.

Stated as an if-then: if CI status check stays at the start of each task iteration, CI failures stay ahead of new feature work, and a deterministic review gate still runs before push, then tightening up the loop by open sourcing the project does not weaken the rule that completion waits on the real CI pipeline green state. In fact, it should make it stronger.

This is an engineering follow-up, not another n=10 Snake batch. The February study supplies the quantitative baseline.

This piece is what we changed to make that baseline shippable.

Setup

How RalphCI fits together (plain language)

Think of the ralphci CLI as a team of coding and review agents that sits between your repo, version control, and your CI pipeline. The model still edits files and runs tests on your machine. The CLI decides what kind of help the coding agent gets next and when commits actually get pushed to version control, so your CI bill and your definition of "green" stay connected to reality.

Iterations. One task iteration is one pass:

Check pipeline status
Run the agent on the current task
Run local checks
Commit if appropriate
Repeat until your task list is complete and, when CI is enabled, the remote pipeline is in a passing state.

CI status early each iteration. At the start of an iteration, the tool asks CircleCI how the latest work on your branch is doing (pass, fail, or still running). Results can be cached so the CLI is not calling the API when nothing new was pushed. That is how the agent gets a factual snapshot of the server-side build instead of guessing.

Build Agent versus CI Doctor. These are not two separate products. They are two modes the orchestrator selects:

Build Agent handles normal forward progress when you are not blocked on a red pipeline (or CI is not in play). The prompt stays lighter and focused on the task list.
CI Doctor turns on when the pipeline failed. The model receives full failure context (including logs) and is steered to fix what broke in CI before piling on new features. Same model; different job description. That is the "multi-agent" shape: split responsibilities instead of one blob of instructions.

Review Gate. Before a git push, the CLI runs your repo's lint and test commands as ordinary shell steps, with a timeout so nothing hangs forever. No LLM in that step. If lint or tests fail, the push does not go out. The point is to catch formatting and unit-test issues locally before they become another failing pipeline run.

Smart Push. This label means the tool does not push on every save or every commit by default. It pushes when local checks (including the Review Gate) succeed, so you are less likely to burn CircleCI credits and time on work that already failed lint or tests on your machine. Push triggers CI; Smart Push aims to commit and push to version control only when the branch already survived those local checks.

When is the run "done"? With CI integration enabled, finishing your tasks is not enough if the pipeline is still red. The loop keeps going until the real CircleCI pipeline is green (per your config), not only until local tests pass. That mismatch between local and remote is exactly what the February study measured in numbers.

Everything above is the default CI-on story. You can tune behavior in ralphci.json and CLI flags without changing that core: for example branch strategy (feature branch vs pushing straight to main), draft vs ready PRs, whether the loop auto-pushes or waits for you, approval gates before you call a run complete, and ralphci run --no-ci when you want local-only iteration with no CircleCI token. See the repo README and AGENTS.md for the full matrix. Those options are packaging; the loop in this section is the part that stays constant when CI is enabled.

Where the baseline came from

The February study holds the detail: same Snake spec, same tasks, same model family, five runs with CI off and five with CI on. One headline from it: 100% of CI-enabled runs ended green on CI; 20% of local-only runs did, with local tests passing in both groups. If you only read one piece from this line of work, that article is the quantitative backbone.

What we changed after that baseline

Work between the February study snapshot and the public repo focused on reliability, clarity, and running on someone else's machine, not on a new product story.

The core loop (Build Agent, CI Doctor, Review Gate, Smart Push, and requiring CI to pass before you call the run complete) stayed the same. The main engineering effort focused on improving how each part handles tricky situations and exceptions. For example, we fixed cases where the CI Doctor would keep running even if a fix had already been applied locally. Pipeline status checks are now limited to just the current branch, preventing confusion with unrelated CI runs. When the pipeline is still running, the tool now polls periodically instead of waiting indefinitely. We also resolved issues with duplicate commits and inaccurate activity metrics, making the logs more reliable and easier to interpret.
We removed non-essential paths. Experimental review-agent code came out. The CLI stays centered on this CI-aware, multi-agent-shaped loop.
We made the tool easy to run from source. The ralphci binary runs TypeScript through tsx, without a separate build step before you iterate.
We licensed it for reuse. MIT. github.com/CircleCI-Research/ralph-ci
We recorded a demo. The YouTube recording walks through setup, a run, and CI feedback in the activity log.

Variables

Piece	What we held constant	What we changed
Core loop	The February study as quantitative prior; same north star that completion tracks the real pipeline, not only local tests	Multi-agent orchestration as the shipped design: Build Agent vs CI Doctor, deterministic Review Gate, Smart Push. That is not one monolithic coding agent doing CI fixes, lint, tests, and features in a single undifferentiated prompt. Plus implementation hardening, docs, MIT, and OSS packaging.
Evidence / format	The February study Snake comparison stays the source for the table below	This write-up adds plain-language mechanics, demo, and release notes, not another n=10 A/B run

Results

Quantitative bar (from the February study)

You do not need the February study to use the tool, but it is where these before/after numbers come from. From the same article:

Metric	Local-only (n=5)	CI-enabled (n=5)
Passed CI pipeline	1/5 (20%)	5/5 (100%)
Avg iterations	7.0	10.4
Avg cost	$5.41	$6.76

This open-source release assumes that gap still matters. We did not run another ten Snake iterations for this document.

What follows is shipping and engineering evidence, not a second controlled trial.

Hardening phase (this release)

Repository: github.com/CircleCI-Research/ralph-ci under MIT.
CI Doctor and Review Gate: Fixes for cases such as doctor spinning and branch-scoped checks, covered by tests in the repo.
Contributor flow: tsx entry, documented pnpm lint:fix && pnpm format:fix && pnpm test:run gate, conventional commit helpers for orchestration.
Demo: Loop Lab recording on YouTube.

ESLint and environment drift

The February study often surfaced ESLint quote style drift. That is representative of what CI is for when agent defaults do not match project config. Hardening does not remove linting; it reduces thrash while the agent aligns with the rules.

Takeaway

The February study asked whether CI feedback changes outcomes when other factors are held constant. This phase asked whether we could publish the mechanism behind those outcomes while keeping the same accountability: green CI pipeline state matters for completion, not only local tests, and Smart Push (see above) still limits unnecessary CircleCI runs.

The design worth naming explicitly is specialization over one generic agent. CI debugging and feature work pull different context and different prompts. The Review Gate keeps a slice of quality checks out of the model entirely and makes them more deterministic. That separation is what we hardened for open source, not a bigger single prompt.

In the February study, we reference breadth-first agent work. That is the idea of trying several variants (more than one agent run or branch on the same spec) instead of one endless depth-first pass. It only makes sense if each variant can ship, meaning CI passes. The study’s main table is what happens when that is not true: same agent, same tasks, locals green, pipeline often still red. Open-sourcing RalphCI is about keeping that bar honest when you explore in parallel.

What's Next

Larger codebases than the Snake exercise. Real services, flaky tests, multi-job pipelines.
More models and runners. Same contract, different agents; compare cost and failure handling.
Community usage. Feedback from people wiring their own CircleCI projects.
Tighter feedback loops. A game that can build itself in 20 to 30 minutes, 100% AFK, is great. Ten minutes would be better. We are tinkering with faster, smaller validation cycles using sandboxes the local agent can drive directly, so not every check has to wait on a full commit, push, and remote pipeline round trip. The same thread shows up in CircleCI’s autonomous validation framing: delivery that keeps pace with AI-era churn, including selective testing (run tests tied to what changed instead of the full suite on every pass) so you get quicker signal without abandoning coverage discipline.
Failure-pattern intelligence. Today the loop is built to clear the red build in front of you. The next layer is memory across incidents: jobs that keep flaking, lint rules that trip the same way, fixes that do not stick. That is different from reacting to the latest log line. Chunk CLI sits near that problem: it mines GitHub PR review comments into markdown prompts for agents, wires validation into coding-agent hooks, and can run checks in cloud sandboxes before push. It is not wired into RalphCI, but it is a plausible place to experiment when we want CI assistance to learn from repetition, not only from the current failure.
Live voice narration. We are also experimenting with voice-led tooling in the spirit of Claude LiveCaster: seven personalities across the SDLC, YAML-driven personas and /loop-style narration on top of real local agent and CI pipeline activity.
New controlled runs when the question warrants a fresh n=10 design. This write-up connects the numbers from the February study to the current codebase; it does not replace new statistics.

For the original comparison tables, see the February study. For a walkthrough, use the YouTube demo. To build or fork, start from RalphCI on GitHub.