Agent Architecture for Software Development
Contents
- Why architecture matters more than code generation
- The layers of a coding agent system
- Request intake and planning
- Execution workers and isolated environments
- Tools, MCP, and browser access
- State, context, and compaction
- Verification, Git boundaries, and review
- Parallel scheduling and dependency handling
- Architectural trade-offs in practice
- How to evaluate an architecture

Try Capy Today
Coding agent architecture for software development is the system design that turns an engineering request into a verified code change. It coordinates planning, model reasoning, isolated execution, tools, context management, Git operations, automated checks, review, and human approval so an agent can complete real repository work without treating code generation as the whole job.
Why architecture matters more than code generation
A coding agent is not just a language model with permission to write files. Useful software work is a stateful control problem. The system must understand a request, inspect a repository, choose a sequence of actions, execute commands, interpret failures, preserve progress, and produce an artifact that can enter a team's normal delivery process.
That is why model benchmarks alone do not tell you whether an agent platform will work for your team. Current models such as Claude Opus 4.7 or 4.6, GPT-5.5 or GPT-5.3-Codex, Gemini 3.1 Pro, and Grok 4.1 Fast may each be suitable for different workloads. The surrounding architecture determines whether the model receives the right context, whether it can safely use tools, and whether its output is verified before merge.
A robust design should make failures legible. If a test fails, the execution worker needs the command output. If two tasks overlap, the scheduler needs dependency information. If a pull request contains a subtle regression, the review layer needs the intended behavior as well as the diff. If a task runs long, the context layer needs to compact history without discarding the implementation state.
The layers of a coding agent system
The exact product boundary varies, but production systems generally need the following layers.
| Layer | Primary responsibility | Failure mode it should prevent |
|---|---|---|
| Request intake | Capture scope, repository, constraints, and desired outcome | Starting with an ambiguous or unauthorized task |
| Planner or orchestrator | Decompose work, identify dependencies, delegate, and monitor progress | Asking one session to improvise a large project end to end |
| Execution workers | Inspect code, edit files, run commands, and iterate | Producing untested snippets instead of repository changes |
| Isolated environment | Provide a controlled filesystem, branch, dependencies, and process space | Cross-task interference or damage to a developer machine |
| Tools, MCP, and browser access | Connect the agent to terminals, code search, APIs, documentation, and web interfaces | Reasoning without the evidence or capabilities the task requires |
| State, context, and compaction | Preserve goals, progress, observations, and handoffs | Losing the plot during long-running work |
| Verification | Run tests, builds, type checks, linters, and targeted acceptance checks | Treating plausible code as correct code |
| Git and PR boundary | Package changes into reviewable branches and diffs | Mixing unrelated edits or bypassing normal delivery controls |
| Review and triage | Analyze diffs, emit findings, prioritize fixes, and re-check changes | Merging defects that automated tests miss |
| Human approval | Keep consequential decisions with accountable people | Allowing autonomy to silently become merge authority |
These layers are separable even when one user interface hides the transitions. Separating them makes the system easier to reason about: a worker should not invent product scope while deep in a failing build, and a reviewer should not assume that passing tests prove the diff matches the request.
Request intake and planning
Request intake should capture more than a prompt string. At minimum, the system needs the target repository, base branch, explicit constraints, acceptance checks, relevant issue or pull request context, and the permissions available to the run. High-quality intake also records which actions require approval: opening a pull request may be routine, while changing infrastructure or merging code may not be.
The planner or orchestrator turns that request into an execution graph. For a small fix, the graph may contain one worker and one verification pass. For a larger initiative, it may contain several independent tasks, a dependency chain, and a final integration step. The planner should distinguish parallelizable work from coupled work. Two workers can often update separate packages safely; two workers rewriting the same shared interface need coordination or sequencing.
Capy uses this separation concretely. Its documentation describes Captain as the planning mode: Captain reads the codebase, creates detailed task specifications, and delegates work. Captain does not edit files or run commands. Build is the execution mode: it edits files, runs commands, installs packages, browses the web, and commits code inside an isolated Ubuntu VM. This is a useful boundary because planning quality and implementation quality are different concerns.
Execution workers and isolated environments
An execution worker needs a real development environment, not only a patch-generation API. It should be able to search the repository, read nearby conventions, edit multiple files, install dependencies, invoke the compiler, run tests, and inspect failures. The worker's loop is empirical: form a hypothesis, change the repository, run the most relevant check, and update the plan from evidence.
Isolation is the safety and concurrency primitive behind that loop. A dedicated cloud VM gives the task its own filesystem and processes. It can install packages or run a container without polluting a developer laptop. More importantly, separate tasks do not fight over uncommitted edits, ports, generated artifacts, or dependency versions.
Capy's welcome guide states the basic contract plainly: describe the task, an agent builds it in its own VM, then review the diff and create a pull request. Capy's Build agent runs in an isolated Ubuntu VM with common languages and tools available. That architecture supports independent task branches while keeping the developer's main branch clean.
Cloud VMs are not free. They add startup cost, resource management, and environment-configuration work. For a fast, interactive one-file edit, a local session can feel more direct. For concurrent tasks, dependency installation, Docker-based services, or work that should continue in the background, the VM boundary usually earns its cost.
Tools, MCP, and browser access
The worker is only as effective as its tool surface. Filesystem reads and writes, code search, Git, and shell execution cover many repository tasks. Real workflows often also need documentation lookup, issue and pull request context, database inspection, screenshots, or interactions with a browser-based product.
Model Context Protocol (MCP) servers provide one way to expose external systems through typed tools. MCP can connect an agent to project-management systems, monitoring data, design tools, or internal services without stuffing every possible fact into the initial prompt. Browser automation fills a different gap: it can validate user-facing behavior, reproduce a UI bug, or operate systems that lack a convenient API.
Tool access should follow least privilege. A worker that only needs read-only documentation should not receive production mutation credentials. Shell commands should run inside the task environment. Sensitive actions should be logged and, where appropriate, gated by human approval. Architecture is not merely about giving an agent more tools; it is about assigning capabilities at the narrowest useful boundary.
State, context, and compaction
A repository, an issue, tool outputs, diffs, and chat history can exceed the useful attention span of a model long before the task is complete. A durable coding-agent system therefore separates operational state from conversational context.
Operational state includes the current branch, changed files, command results, open findings, task status, and dependency graph. Context is the bounded information passed into the next model turn. The context layer should retain the specification, key observations, decisions, verification status, and next steps while removing redundant logs and stale exploration.
Capy documents a handoff mechanism for long tasks: Build and Captain can continue work in a fresh context with a concise summary, progress, and next steps. As context grows, the system provides progressive reminders before a handoff becomes required. This is a practical form of compaction. It treats context as a managed resource rather than relying on an indefinitely growing transcript.
Verification, Git boundaries, and review
Verification should happen continuously, not as a ceremonial final step. A worker should begin with targeted checks close to the edited surface, then run the broader build or test suite required by the repository. Static analysis, tests, formatting, build output, and an exercised user flow answer different questions. No single check substitutes for the rest.
Git creates an important boundary between autonomous execution and team delivery. Each task should land on a dedicated branch with a focused diff. The pull request becomes a reviewable artifact: humans and automated systems can inspect exactly what changed, compare it with the request, and decide whether it is ready to merge. The worker can prepare the change without owning the final decision.
Review deserves its own agent role. Capy's PR review documentation describes a Review agent that reads pull request diffs and emits structured findings with a title, rationale, category, severity, and code location. Captain can then triage the findings, mark false positives as irrelevant, recognize resolved issues, and delegate real fixes back to Build. This creates a controlled loop: implementation, verification, review, triage, repair, and re-review.
Human approval remains necessary. Tests cannot decide whether a product compromise is acceptable. A reviewer cannot infer every operational risk. Merge authority, permission-sensitive actions, and high-impact changes should stay with accountable humans even when agents do most of the mechanical work.
Parallel scheduling and dependency handling
Parallel agents increase throughput only when the scheduler understands dependencies. A backlog is not automatically a set of safe concurrent jobs. The orchestrator should identify shared files, schema dependencies, generated artifacts, ordering constraints, and integration risks before launching workers.
The cleanest fanout pattern is one independent deliverable per environment and branch. Workers can progress concurrently, while the scheduler monitors status and routes blockers. If task B depends on an interface introduced by task A, the scheduler can wait, rebase onto A's branch, or define a stable contract before either worker begins. If two tasks unexpectedly collide, the system should surface the conflict rather than silently combine edits.
Parallelism also applies inside a task. Read-only research, independent code searches, or unrelated verification commands may run concurrently. Editing the same file from multiple workers is a different matter: local speedups can create integration debt. The architecture should maximize useful concurrency, not worker count.
Architectural trade-offs in practice
No coding-agent architecture is universally best. The right choice depends on task size, coupling, environment needs, review requirements, and how much workflow automation a team wants.
| Pattern | Strengths | Trade-offs | Best fit |
|---|---|---|---|
| Single session | Fast feedback, low orchestration overhead, direct developer control | Context can become noisy; limited throughput; shared local state | Small fixes and interactive exploration |
| Planner → worker | Clear specifications, cleaner execution focus, easier delegation | Additional coordination step; plan can become stale if reality differs | Multi-step features and well-scoped backlog work |
| Multi-agent fanout | Concurrent progress across independent tasks | Conflict resolution, dependency tracking, and integration overhead | Batches of separable work across modules or repositories |
| Local worktrees | Lightweight branch isolation with familiar local tools | Shared machine resources; setup drift; developer machine remains involved | Desktop workflows with moderate parallelism |
| Cloud VMs | Strong task isolation, background execution, reproducible toolchains | Provisioning latency, cost, and secrets management | Autonomous work, CI-like environments, and higher concurrency |
Public products illustrate these choices without proving that one pattern wins everywhere. Devin's managed Devins documentation describes a coordinator session that scopes work, launches child sessions in isolated VMs, monitors progress, resolves conflicts, and compiles results. That is a multi-agent fanout design suited to work that can be split into packages.
GitHub Copilot cloud agent takes a GitHub-centered approach. It can research a repository, plan, make changes on a branch, execute checks in its own ephemeral GitHub Actions-powered development environment, and optionally open a pull request. Its documented constraints, including one branch and one pull request per assigned task, are meaningful architectural boundaries rather than minor implementation details.
Codex cloud runs background tasks, including parallel tasks, in its own cloud environments and can create pull requests from connected GitHub repositories. The emphasis is delegated execution with configurable environments and tools. Conductor exposes a local-workspace choice: separate isolated workspaces for separate branches, or multiple agent workflows inside one workspace when the work belongs on the same branch.
Capy's design combines a planner → worker split with isolated task VMs, branch-based delivery, and a dedicated review-and-triage loop. That is a strong fit when the goal is not merely to accelerate one coding conversation, but to move several reviewable units of software work through a consistent pipeline.
How to evaluate an architecture
Start with your failure modes, not a feature checklist. Ask whether the system can preserve a clean branch per task, reproduce your toolchain, limit credentials, verify the requested behavior, compact long-running context, and expose an auditable diff. Then ask how it behaves when work is parallel: whether dependencies are modeled, whether conflicts are surfaced, and whether reviewers can understand the origin and status of each change.
Finally, measure outcomes at the pull-request boundary. Useful metrics include acceptance-check pass rate, reviewer correction rate, time from request to reviewable diff, rework after review, conflict frequency, and the proportion of tasks that require human rescue. The best architecture for a team is the one that produces trustworthy, mergeable work with a clear operational model—not the one that simply launches the largest number of agents.
Frequently Asked Questions
What is coding agent architecture for software development?+
Why run coding agents in isolated environments?+
When should a coding system use multiple agents?+
How should coding agents handle long tasks?+
Does an agent-generated pull request still need human review?+
Turn agent prompts into a delivery system.
Plan with Captain, build in isolated VMs, review structured findings, and keep every change inside a clear PR boundary.

