Long-Running Coding Tasks
Contents

Try Capy Today
A long running coding agent is an AI development agent that can continue a bounded software task across setup, implementation, testing, and review without requiring a developer to supervise every step. Reliability comes from persistent execution state, managed context, checkpoints, and a clear stop condition — not from keeping a VM alive forever.
Why coding tasks become long-running
Many useful engineering tasks take longer than a chat window or a single edit-and-test loop. A dependency upgrade may require installing packages, finding breaking API changes, updating several modules, running a slow test suite, and repairing failures. A frontend change may need a browser check after the build. A migration may require planning before any code changes are safe.
The difficult part is not merely elapsed time. Long-running work creates state that must survive each iteration:
- Environment state: installed dependencies, build artifacts, containers, and generated files.
- Repository state: partial edits, Git branches, commits, and uncommitted diagnostics.
- Reasoning state: the original request, discovered constraints, rejected approaches, and the next action.
- Verification state: which checks passed, which failed, and what still needs evidence.
- Budget state: how much runtime and model usage remain before the task should pause or ask for input.
An agent that can write a patch but cannot preserve or recover these states will struggle as the work expands. An agent that can stay active for hours but lacks a finish condition may simply spend more time wandering. Duration is a systems-design problem, not a model benchmark.
The reliability stack for long-running work
A practical long-running coding agent needs several layers working together.
Persistent but reproducible environments
A durable workspace avoids reinstalling packages and losing partial progress between iterations. It also lets the agent inspect the real effects of its changes: compiler errors, test output, running services, and generated diffs. But persistence alone is not enough. The setup should remain deterministic through lockfiles, bootstrap commands, environment configuration, and repository instructions. Otherwise, an old workspace can hide missing setup steps that will fail in CI or for another developer.
The right principle is: preserve the active workspace, but make it possible to rebuild that workspace from declared inputs.
Context management and compaction
Long tasks accumulate more information than a model should replay verbatim forever. Raw terminal logs, full file contents, and earlier dead ends can crowd out the current objective. A reliable agent periodically compacts that history into a concise working state: completed work, unresolved failures, important file paths, explicit constraints, and the next checks to run.
Capy's handoff model is designed for this problem. For long tasks, Build and Captain can continue work in a fresh context through a handoff that carries forward a summary, progress, and next steps. The workspace can remain useful while the reasoning context is refreshed. That is materially different from pretending one ever-growing prompt will remain efficient indefinitely.
Checkpoints and human decision points
A checkpoint is a recoverable point with enough evidence to continue or stop safely. Useful checkpoints include a clean dependency install, a passing focused test, a completed migration step, a build result, or a reviewable diff. Human checkpoints matter when a task reaches an architectural fork, requests a destructive operation, needs new credentials, or would spend materially more budget than expected.
This is not a failure of autonomy. The agent should execute routine steps independently and surface the small set of decisions where human judgment is valuable. A good long-running workflow alternates autonomous stretches with intentional review points.
Verifiable acceptance criteria
“Improve the checkout flow” is hard to finish reliably. “Handle network failures and invalid-card responses in src/payments/checkout.ts, follow the existing order-service pattern, and run the checkout integration suite” gives the agent an observable completion boundary.
Acceptance criteria should name behavior, scope, and the strongest relevant check. The agent can then stop when it has evidence, rather than when it has generated a plausible-looking diff. For large tasks, intermediate checks reduce the cost of discovering a mistake at the end.
Branch isolation
Long work should not mutate a developer's main checkout or collide with another task. One branch or worktree per task makes partial progress reviewable and disposable. It also creates a clear handoff artifact: a diff or pull request that can be reviewed without importing all of the agent's internal reasoning.
Isolation becomes more important when several agents run concurrently. Parallelism without branch discipline creates merge conflicts and ambiguous ownership. Parallelism with scoped branches turns independent tasks into an understandable queue of reviewable changes.
Credit and timeout controls
Long-running does not mean unlimited. Teams need a stop policy for sessions that are blocked, looping, or simply too expensive relative to the value of the change. That policy can include maximum execution windows, credit balances, spend caps, model choices, and manual checkpoints before high-cost follow-up work.
Capy's pricing documentation makes the budget behavior explicit. Credits cover AI usage, VM runtime, and auxiliary services such as the Review Agent. When the balance runs out, active tasks pause. Auto-reload can purchase more credits to keep work moving, but an organization-wide monthly spend cap remains a hard ceiling; teams can also disable overage entirely. Pause semantics are safer than implying a VM runs forever.
How Capy handles long-running coding tasks
In Capy, a task is a coding session with its own branch and its own isolated Ubuntu VM. The Build agent works inside that VM with common development tools available: it can edit files, install packages, run commands, use Docker, test the application, and inspect the result. The main branch stays clean while work is in progress.
Capy separates planning from implementation. Captain reads the codebase, turns complex requests into task specifications, and delegates work. Build executes a scoped implementation in the VM. That split is useful for long work because planning and coding have different failure modes: the planner should decompose scope and define evidence, while the implementation agent should stay focused on editing and verification.
For tasks that outgrow one reasoning context, Capy uses context compaction through handoffs. A fresh context receives the concise state needed to continue instead of replaying every earlier token. This preserves continuity without treating prompt length as infinite.
After implementation, Capy's Review Agent adds a triage loop. It reviews PR findings, classifies them as open, resolved, or irrelevant, and can route actionable findings on Capy-generated work back to Build for fixes and re-review. Humans still decide what to merge. Review triage is especially useful after long implementations because the final diff may contain interactions that were not obvious during any one edit.
Competitor tradeoffs
“Background agent” is not one uniform product shape. Each tool makes a different tradeoff among maximum session duration, environment isolation, parallelism, automation, and developer control.
| Tool | Long-running or asynchronous model | Useful tradeoff to understand |
|---|---|---|
| Capy | One isolated Ubuntu VM and branch per task, Captain planning, Build implementation, context-compacted handoffs, and Review triage | Designed for bounded tasks that may need extended implementation and review; credits, auto-reload, and spend caps control continuation |
| GitHub Copilot cloud agent | Ephemeral GitHub Actions-powered development environment with branch-based work | Each session has a hard 59-minute maximum that cannot be extended or bypassed, so larger work needs decomposition |
| Codex cloud | Background tasks can run in parallel in their own cloud environments | Strong fit for OpenAI-native cloud delegation; Codex app automations also support scheduled background runs and dedicated worktrees |
| Cursor Cloud Agents | Isolated cloud VMs with cloned repos, dependencies, secrets, startup commands, network access, parallel runs, and remote desktop control | Environment configuration is central; Cursor supports saved snapshots, Dockerfile-based setup, MCP servers, and artifacts for verification |
| Devin | Managed sessions can run in parallel in isolated VMs, with a coordinator that scopes work and monitors progress | Adds session orchestration, ACU monitoring, playbooks, and one-time or recurring schedules for workflows that benefit from managed execution |
| Kiro autonomous agent | Preview product with asynchronous tasks in isolated sandboxes and up to 10 concurrent tasks for paid users during rollout | Emphasizes cross-task context, sandbox configuration, sub-agent coordination, and review-driven learning; preview availability and limits matter |
A fixed timeout is not automatically bad. GitHub Copilot's 59-minute cap makes the boundary easy to reason about, but it pushes teams to split work that may need longer verification. Persistent cloud environments are not automatically better either: they require careful setup, secrets handling, budget controls, and a way to detect stalled work.
Codex's automation model illustrates another distinction. A scheduled automation can start fresh or return to a thread, and Git repositories can use a dedicated background worktree so automation changes do not collide with unfinished local edits. That is useful for recurring triage, but it is a different problem from carrying one large implementation across multiple checkpoints.
Practical checklist
Before delegating a task that may run for a while, use this checklist:
- Bound the task. Describe one coherent feature, fix, migration, or refactor. Split independent outcomes into separate tasks.
- State the acceptance criteria. Name expected behavior, files or modules in scope, and the strongest relevant test, build, lint, or browser check.
- Make setup deterministic. Commit lockfiles, document bootstrap commands, and provide only the environment variables and secrets the task needs.
- Isolate the branch. Give each task its own branch, worktree, or VM-backed checkout. Do not let background work modify an active local branch unexpectedly.
- Choose checkpoints. Require progress to become observable after setup, after a focused implementation step, before expensive verification, and before merge.
- Plan context recovery. Ensure the agent can summarize completed work, preserve the workspace, and continue with a compact next-step state when the conversation grows large.
- Set cost boundaries. Decide whether to use a fixed timeout, a credit limit, auto-reload, a monthly spend cap, or a human approval before additional runtime.
- Review the final artifact. Inspect the diff, the command output, and any review findings. Merge only after the acceptance check has passed.
What to optimize for
The best long-running agent is not the one with the longest advertised runtime. It is the one that can turn an extended task into a sequence of recoverable, observable steps and stop with evidence. Persistent VMs, context compaction, branch isolation, and background execution matter because they support that workflow. They are not substitutes for a precise task definition.
For teams evaluating tools, start with a real task that includes setup, a multi-file change, and a meaningful verification command. Measure how often the agent needs help, whether it preserves useful state, how clearly it reports checkpoints, and whether the final diff is reviewable. A reliable coding agent should make long work easier to supervise, not harder to understand.
Frequently Asked Questions
What is a long-running coding agent?+
Do long-running coding agents run forever?+
Why does environment persistence matter for coding agents?+
How should teams control the cost of long-running coding tasks?+
When should a large coding task be split into smaller tasks?+
Keep long coding tasks reviewable.
Use Captain to plan, Build to implement in isolated VMs, and Review to triage the result.

