Loop Engineering: Design Systems That Prompt Your AI Agents
How I stopped prompting turn by turn and started designing loops that discover work, run agents, verify results, and remember state across sessions.
Loop Engineering: Design Systems That Prompt Your AI Agents
Last month I spent forty minutes in Cursor re-prompting the same auth bug. The agent kept patching the wrong layer. SQL escaping, then client validation, then a serializer tweak. Each answer looked plausible. None of it stuck because I was the loop: type prompt, read output, type next prompt. By lunch I had three half-finished diffs and zero green tests.
That is when I stopped treating the agent like a chat partner and started treating my repo like a system that could prompt the agent for me. Loop engineering is that shift. You design a small control system that discovers work, hands it to agents, verifies results, persists state, and decides what happens next, on a schedule or until a goal is met. You replace yourself as the person who prompts the agent.
The Shift: From Prompts to Loops
For two years, the workflow was simple: write a good prompt, read the output, write the next prompt. You held the agent the entire time, one turn after another. That model still works for focused tasks. But the leverage point has moved.
Prompt engineering shapes what goes into a model. Loop engineering shapes the full process around it: the tools the agent can use, the context it sees, the validation it trusts, and when it stops. A prompt optimizes for a better first answer. A loop optimizes for a better verified outcome.
Think of it in three layers. Context engineering is what the agent knows. Harness engineering is the environment for a single agent run. Loop engineering sits above both: it keeps poking agents on a schedule, spawning helpers, and feeding itself while you are not in the terminal.
The tooling names differ across Claude Code, Codex, Cursor, and Grok, but the shape converges. Once you see the pattern, you stop arguing about which product and start designing loops that work in any of them.
What Clicked for Me
Before loops, my mornings started with archaeology. Scroll CI logs, grep Slack, open five stale issues, forget which one mattered. After I wired a daily triage loop in Cursor, I opened my laptop to a STATE.md file: three CI failures ranked by blast radius, one stale dependency, a draft PR for a flaky test fix I had not asked for yet. The verifier sub-agent had run overnight. I reviewed the diff over coffee, merged it, and moved on.
That was the wow moment. The agent forgets between sessions. The repo does not. The state file became the spine of everything I built after.
The Inner Loop Inside Every Loop
Every loop contains a smaller cycle: plan, search, modify, verify, repair. The power is not in any single step; it is in closing the loop. A test failure is new context. A type error signals a wrong assumption.
A weak loop guesses that an apostrophe bug is SQL escaping and patches the query. A strong loop finds the form, API route, validation schema, and database path first, reproduces the failure, then changes the smallest relevant path and runs a targeted regression test.
flowchart LR
A[Intent] --> B[Context]
B --> C[Action]
C --> D[Observation]
D --> E{Done?}
E -->|no| F[Adjust]
F --> B
E -->|yes| G[Summarize]What One Loop Looks Like End to End
A loop is a recursive goal: you define a purpose and the AI iterates, often with sub-agents and external memory, until the goal is complete or the loop escalates to you. You design it once. You are not prompting every micro-step.
flowchart LR
A[Schedule] --> B[Triage skill]
B --> C[STATE.md]
C --> D[Worktree]
D --> E[Implementer]
E --> F[Verifier]
F --> G[MCP / PR]
G --> H{Human gate?}
H -->|safe| I[Commit / merge]
H -->|risky| J[Escalate to you]
I --> A
J --> AHere is the morning triage pattern I keep coming back to:
- Schedule: an automation runs on a daily cadence, or a goal-style run continues until a verifiable stopping condition holds, with a separate check so the worker does not grade its own homework.
- Triage: a skill reads yesterday's CI failures, open issues, and recent commits.
- State: findings land in
STATE.md, the durable spine between runs. - Worktree: for actionable items, an isolated checkout opens so parallel agents do not collide.
- Implementer: a sub-agent drafts the fix.
- Verifier: a separate sub-agent runs tests and checks the diff against project skills.
- Connectors: the loop opens the PR and updates the ticket.
- Human gate: product intent and risky changes escalate to you with full context.
Six Parts Every Loop Needs
A loop that runs unattended is not one long prompt. It is a small system with six parts.
flowchart TB
subgraph blocks [Six building blocks]
S[Scheduling<br/>heartbeat]
W[Worktrees<br/>parallel isolation]
K[Skills<br/>project knowledge]
C[Connectors<br/>MCP tools]
SA[Sub-agents<br/>maker / checker]
M[Memory<br/>STATE.md]
end
S --> W --> K --> C --> SA --> M- Scheduling: the heartbeat. Without a cadence, you have a one-off session.
- Worktrees: safe parallel execution. Two agents editing the same files is merge disaster waiting to happen.
- Skills: persistent project knowledge written once so the agent does not re-derive your whole project from zero every run.
- Connectors (MCP): reach into real tools: PRs, tickets, Slack. A filesystem-only loop can only suggest.
- Sub-agents: the maker/checker split. The model that wrote the code is structurally poor at judging its own work.
- Memory / state: answers three questions: what are we working on, what did we try last time, and what is waiting for a human?
In Cursor, skills live in .cursor/skills/, MCP servers connect to external tools, rules encode persistent knowledge, and subagents handle the maker/checker split.
Four Levels You Can Stack
You do not need all four on day one. I started with levels 1 and 2. I am building toward 3 and 4.
flowchart TB
L1[Level 1: Agent loop<br/>model + tools until done]
L2[Level 2: Verification<br/>grader + retry on failure]
L3[Level 3: Event-driven<br/>cron, webhooks, Slack]
L4[Level 4: Hill-climbing<br/>traces improve the harness]
L1 --> L2 --> L3 --> L4
L4 -->|updates prompts, tools, graders| L1Level 1: Agent loop. A model calls tools until a task is complete. Most coding agents give you this out of the box.
Level 2: Verification loop. Wrap the first pass in a grader (tests, CI, or an LLM judge) that scores output and retries on failure.
Level 3: Event-driven loop. Cron, webhooks, or Slack trigger the agent without you in the terminal.
Level 4: Hill-climbing loop. Traces from each run feed an analysis pass that rewrites prompts, tool configs, or grader rules. The outer loop reaches inside and updates the inner loops.
Human oversight belongs at every level. Graders catch broken links and failing CI. Humans catch wrong framing, product tradeoffs, and architecture calls.
Patterns and a Safe Starting Point
Daily triage is the lowest-risk entry point. Week one: report only, no auto-fix. A scheduled run discovers CI failures, stale issues, and dependency drift, then writes a report to your state file. Roll out in phases: report-only, then assisted fixes with human review, then unattended for allowlisted low-risk changes.
A test-driven loop is a good second pattern. Reproduce the failure, confirm it fails for the right reason, implement the smallest fix, rerun the targeted test, broaden validation only when the narrow case passes. Compiler-driven and review-driven loops follow the same shape with different observation sources.
Triage should be cheap. Sub-agents spawn only when state says actionable. When a workflow works, turn it into a skill so you do not reinvent prompts and stopping rules every time.
What Broke for Me (and What I Guard Against)
Loops amplify judgment, good and bad. Five failure modes showed up:
- Thrashing: unclear goals, oversized diffs, or noisy validation signals.
- Overfitting to tests: CI goes green but product behavior is wrong.
- Context drift: the agent works from stale assumptions after nearby edits. Refresh context after meaningful observations.
- Token cost explosion: a tight cadence with implementer plus verifier on every run burns budget fast.
- Comprehension debt: the loop ships faster than I understand. Building loops to avoid thinking is the trap.
Four guardrails I enforce: clear stopping rules, small reversible diffs, explicit verify commands (npm test -- auth.test.ts, not "make sure it works"), and human gates for destructive or ambiguous actions.
What I'm Building Next
I'm still tuning daily triage and testing overnight verifier runs. The next frontier is level 4: using run traces to tighten graders and skills instead of hand-editing them after every failure. If you want follow-up posts on what broke and the exact Cursor setup, subscribe on the newsletter.
The Bar I Actually Care About
Two people can build the exact same loop and get opposite results. One uses it to move faster on work they understand deeply. The other uses it to avoid understanding the work at all. The loop does not know the difference. You do.
Prompting directly is still powerful. But the leverage point has moved, from individual prompts to systems that discover, assign, verify, persist, and know when to hand off to you. The best loop I built did not make me obsolete. It made my mornings quieter and my merges safer. Build the loop. Stay the engineer. That is the bar.
Subscribe to the newsletter
Get notified when I publish new content. No spam, unsubscribe at any time.