In the loop, out of the loop: the KPI shift behind agentic engineering

Imagine this. A customer reports a bug in Slack. You paste it into a channel. A plan gets drafted, reviewed, split into concrete steps. Code is written in an isolated worktree. Build runs. Tests run. Lint and format run. A pull request lands in your GitHub, with a diff small enough to read in a minute. You click approve. It ships.

You did not write the code. You did not open your editor. You did not context-switch. You made two decisions. "Yes, this is the right fix." "Yes, the diff looks good." Everything in between ran without you.

That is out-of-the-loop development. The teams that get there first ship faster than their industry, and spend less engineer attention doing it.

The productivity trap most teams walk into

Every team adopting agents sees output go up. PRs opened goes up. Tickets closed goes up. Lines of code goes up.

But if human involvement goes up at the same rate, you have not won anything. You have bought more output with more attention. Your engineers still context-switch the same amount, still sit in review queues, still get pulled into the same meetings. The tools got better. The job did not. Industry data on AI coding productivity has stayed mixed for this reason: teams spend as much time specifying and reviewing as they save by having an agent generate the code.

The goal is divergence. Output climbs. Human involvement flattens, then drops. That gap is the thing you are optimizing for.

This is the shift from in-the-loop KPIs to out-of-the-loop KPIs.

In-the-loop KPIs: how much the human is still doing

In-the-loop KPIs measure the human cost of each unit of shipped work. They answer one question. For every PR merged, how much of your team's attention did it consume?

Examples:

Human approval rate. Share of agent-authored PRs a human approves on first pass. Low means your reviewers are correcting, not approving. Rising means the pipeline is learning your conventions.
PRs opened per engineer per week. Headline number, but only meaningful when paired with how much of the engineer's day went into producing each one.
Review time per PR. If agent-authored PRs take longer to review than human-authored ones (and the industry data says they do, roughly 1.7x on average), that is an in-the-loop cost you need to see.
Correction frequency. How often humans rewrite, revert, or hand-edit agent output before it ships.

If these numbers dominate your dashboard, you are running a faster version of the old process. Same bottlenecks, more volume.

Out-of-the-loop KPIs: how much the pipeline does on its own

Out-of-the-loop KPIs measure what ships without you. They answer a different question. How much of your delivery runs on autopilot and still meets your quality bar?

Examples:

Agentic approval rate. Share of PRs the pipeline approves and merges autonomously. Starts at zero. Grows as trust compounds on well-scoped work.
Verification pass rate without intervention. How often build, test, lint, and format gates go green without a human stepping in.
Time from idea to PR, hands-off. Not "idea to PR" as a blended number, but specifically the segment where no human touched it.
Token cost per merged PR. Drops as the pipeline learns which model fits each step. A pipeline that costs less per PR each month is a pipeline that is learning.
Output per engineer, normalized. The single number you take to the leadership meeting. Grows when out-of-the-loop throughput grows.

The jump from in-the-loop to out-of-the-loop is not a feature flag. You earn it one PR at a time. The numbers move as your pipeline proves it can handle a class of work, then the next class, then the next.

How you actually earn out-of-the-loop throughput

You offload one stage of your CI pipeline at a time to Promptware. Each Promptware is a self-improving software agent with specific instructions in its Program.md, scoped tools, and persistent memory. It runs a task, reflects on how it did, and writes what it learned back to its own memory. Next invocation, it starts smarter.

Ivy Tendril ships Promptware for every stage of the lifecycle: MakePlan, ExpandPlan, SplitPlan, UpdatePlan, ExecutePlan, MakePr, and the verification gates that sit between them (DotnetBuild, DotnetTest, DotnetFormat, FrontendLint, CheckResult). You do not write one giant prompt. You compose small, specialized agents, each good at one thing, each accumulating knowledge about your codebase in its own Memory folder.

The guardrails come from three places.

Verification gates. Build, test, lint, format. A PR cannot advance until they pass. If they fail, the agent gets the full logs and tries again, up to a cap.
Isolated execution. Every plan runs in its own git worktree. No shared branches. No cross-contamination. If an agent goes sideways, the blast radius is one folder you delete.
Two human checkpoints. Review the plan before execution. Review the diff before the PR. Between those two points, the pipeline runs on its own.

Self-improvement comes for free. The agent reflects after every run. Failures that turned into fixes get written into Memory. Patterns that worked get promoted into Program.md. The pipeline you run in month three is not the one you installed in month one. It has learned your codebase.

The risk that comes with going out of the loop

Here is the honest part. "Out of the loop" is a loaded phrase in safety research. Mica Endsley's 1995 work on industrial and aviation automation showed that operators who monitor automated systems without engaging them lose situation awareness, get complacent, and fail to recover when the automation fails. Pilots staring at a working autopilot miss the moment it stops working. The phenomenon has a name: the out-of-the-loop performance problem.

The lesson applies here. If "out of the loop" means "no humans anywhere, ever," you are building the exact system Endsley warned about. The pipeline works until it does not, and when it does not, nobody understands it well enough to take over.

Ivy's answer is deliberate. Out of the execution loop, never out of the planning or review loop. Humans write the plan. Humans approve the diff. Between those two, the pipeline runs. That is the entire design.

Two checkpoints, done well, keep situation awareness intact. Engineers stay close to what is being built and why, because those are the decisions they are making. They step out of the parts that do not need them. Typing boilerplate. Running the test suite. Fixing a lint error. Pushing a branch. Chasing a flaky build on a Friday afternoon.

This is also why agent-agnostic orchestration matters. The execution engine (Claude Code, Codex, Gemini CLI, whatever ships next quarter) is swappable. The planning and review layer is stable and yours. The model you use for code generation can change every six months; your agents' Memory folders, verification gates, and approval workflow do not.

What this frees humans to do

Humans are good at work that requires taste. Product direction. Architectural calls. Deciding what not to build. Reading a diff and knowing it is wrong for reasons no linter will catch.

Humans are less good, and more miserable, at execution work that has a right answer. Reformatting a file. Wiring up boilerplate. Writing the fourth CRUD endpoint this week. Chasing a flaky test. These are the parts of the loop an agent can take. Every hour you claw back from them is an hour you spend on product, architecture, or the next thing to build.

The case for out-of-the-loop development is not raw output. It is more output per hour of engineer attention, with the attention redirected to decisions only a human can make.

How to start

Look at your current dashboard. For each metric, ask one question. Does this number go up because a human worked harder, or because the pipeline worked smarter?

If it is mostly the first, you are still in the loop. That is fine as a starting point. Pick one stage of your CI you trust least to agents today, wrap it in Promptware, add a verification gate, and measure what happens over the next 30 days. The agentic approval rate climbs from zero. Token cost per PR drops as the Memory files fill in. Somewhere around week three, the review queue stops being the bottleneck.

The fastest builders in their industry are not the ones whose engineers write the most code. They are the ones whose pipelines ship the most code their engineers did not have to write.