guides

3 Jun 26

Agent Architecture for Software Development

CaCapy Team, Product Team

Why architecture matters more than code generation
The layers of a coding agent system
Request intake and planning
Execution workers and isolated environments
Tools, MCP, and browser access
State, context, and compaction
Verification, Git boundaries, and review
Parallel scheduling and dependency handling
Architectural trade-offs in practice
How to evaluate an architecture

Try Capy Today

Coding agent architecture for software development is the system design that turns an engineering request into a verified code change. It coordinates planning, model reasoning, isolated execution, tools, context management, Git operations, automated checks, review, and human approval so an agent can complete real repository work without treating code generation as the whole job.

Why architecture matters more than code generation

A coding agent is not just a language model with permission to write files. Useful software work is a stateful control problem. The system must understand a request, inspect a repository, choose a sequence of actions, execute commands, interpret failures, preserve progress, and produce an artifact that can enter a team's normal delivery process.

That is why model benchmarks alone do not tell you whether an agent platform will work for your team. Current models such as Claude Opus 4.7 or 4.6, GPT-5.5 or GPT-5.3-Codex, Gemini 3.1 Pro, and Grok 4.1 Fast may each be suitable for different workloads. The surrounding architecture determines whether the model receives the right context, whether it can safely use tools, and whether its output is verified before merge.

A robust design should make failures legible. If a test fails, the execution worker needs the command output. If two tasks overlap, the scheduler needs dependency information. If a pull request contains a subtle regression, the review layer needs the intended behavior as well as the diff. If a task runs long, the context layer needs to compact history without discarding the implementation state.

The layers of a coding agent system

The exact product boundary varies, but production systems generally need the following layers.

Layer	Primary responsibility	Failure mode it should prevent
Request intake	Capture scope, repository, constraints, and desired outcome	Starting with an ambiguous or unauthorized task
Planner or orchestrator	Decompose work, identify dependencies, delegate, and monitor progress	Asking one session to improvise a large project end to end
Execution workers	Inspect code, edit files, run commands, and iterate	Producing untested snippets instead of repository changes
Isolated environment	Provide a controlled filesystem, branch, dependencies, and process space	Cross-task interference or damage to a developer machine
Tools, MCP, and browser access	Connect the agent to terminals, code search, APIs, documentation, and web interfaces	Reasoning without the evidence or capabilities the task requires
State, context, and compaction	Preserve goals, progress, observations, and handoffs	Losing the plot during long-running work
Verification	Run tests, builds, type checks, linters, and targeted acceptance checks	Treating plausible code as correct code
Git and PR boundary	Package changes into reviewable branches and diffs	Mixing unrelated edits or bypassing normal delivery controls
Review and triage	Analyze diffs, emit findings, prioritize fixes, and re-check changes	Merging defects that automated tests miss
Human approval	Keep consequential decisions with accountable people	Allowing autonomy to silently become merge authority

These layers are separable even when one user interface hides the transitions. Separating them makes the system easier to reason about: a worker should not invent product scope while deep in a failing build, and a reviewer should not assume that passing tests prove the diff matches the request.

Request intake and planning

Request intake should capture more than a prompt string. At minimum, the system needs the target repository, base branch, explicit constraints, acceptance checks, relevant issue or pull request context, and the permissions available to the run. High-quality intake also records which actions require approval: opening a pull request may be routine, while changing infrastructure or merging code may not be.

The planner or orchestrator turns that request into an execution graph. For a small fix, the graph may contain one worker and one verification pass. For a larger initiative, it may contain several independent tasks, a dependency chain, and a final integration step. The planner should distinguish parallelizable work from coupled work. Two workers can often update separate packages safely; two workers rewriting the same shared interface need coordination or sequencing.

Capy uses this separation concretely. Its documentation describes Captain as the planning mode: Captain reads the codebase, creates detailed task specifications, and delegates work. Captain does not edit files or run commands. Build is the execution mode: it edits files, runs commands, installs packages, browses the web, and commits code inside an isolated Ubuntu VM. This is a useful boundary because planning quality and implementation quality are different concerns.

Execution workers and isolated environments

An execution worker needs a real development environment, not only a patch-generation API. It should be able to search the repository, read nearby conventions, edit multiple files, install dependencies, invoke the compiler, run tests, and inspect failures. The worker's loop is empirical: form a hypothesis, change the repository, run the most relevant check, and update the plan from evidence.

Isolation is the safety and concurrency primitive behind that loop. A dedicated cloud VM gives the task its own filesystem and processes. It can install packages or run a container without polluting a developer laptop. More importantly, separate tasks do not fight over uncommitted edits, ports, generated artifacts, or dependency versions.

Capy's welcome guide states the basic contract plainly: describe the task, an agent builds it in its own VM, then review the diff and create a pull request. Capy's Build agent runs in an isolated Ubuntu VM with common languages and tools available. That architecture supports independent task branches while keeping the developer's main branch clean.

Cloud VMs are not free. They add startup cost, resource management, and environment-configuration work. For a fast, interactive one-file edit, a local session can feel more direct. For concurrent tasks, dependency installation, Docker-based services, or work that should continue in the background, the VM boundary usually earns its cost.

Tools, MCP, and browser access

The worker is only as effective as its tool surface. Filesystem reads and writes, code search, Git, and shell execution cover many repository tasks. Real workflows often also need documentation lookup, issue and pull request context, database inspection, screenshots, or interactions with a browser-based product.

Model Context Protocol (MCP) servers provide one way to expose external systems through typed tools. MCP can connect an agent to project-management systems, monitoring data, design tools, or internal services without stuffing every possible fact into the initial prompt. Browser automation fills a different gap: it can validate user-facing behavior, reproduce a UI bug, or operate systems that lack a convenient API.

Tool access should follow least privilege. A worker that only needs read-only documentation should not receive production mutation credentials. Shell commands should run inside the task environment. Sensitive actions should be logged and, where appropriate, gated by human approval. Architecture is not merely about giving an agent more tools; it is about assigning capabilities at the narrowest useful boundary.

State, context, and compaction

A repository, an issue, tool outputs, diffs, and chat history can exceed the useful attention span of a model long before the task is complete. A durable coding-agent system therefore separates operational state from conversational context.

Operational state includes the current branch, changed files, command results, open findings, task status, and dependency graph. Context is the bounded information passed into the next model turn. The context layer should retain the specification, key observations, decisions, verification status, and next steps while removing redundant logs and stale exploration.

Capy documents a handoff mechanism for long tasks: Build and Captain can continue work in a fresh context with a concise summary, progress, and next steps. As context grows, the system provides progressive reminders before a handoff becomes required. This is a practical form of compaction. It treats context as a managed resource rather than relying on an indefinitely growing transcript.

Verification, Git boundaries, and review

Verification should happen continuously, not as a ceremonial final step. A worker should begin with targeted checks close to the edited surface, then run the broader build or test suite required by the repository. Static analysis, tests, formatting, build output, and an exercised user flow answer different questions. No single check substitutes for the rest.

Git creates an important boundary between autonomous execution and team delivery. Each task should land on a dedicated branch with a focused diff. The pull request becomes a reviewable artifact: humans and automated systems can inspect exactly what changed, compare it with the request, and decide whether it is ready to merge. The worker can prepare the change without owning the final decision.

Review deserves its own agent role. Capy's PR review documentation describes a Review agent that reads pull request diffs and emits structured findings with a title, rationale, category, severity, and code location. Captain can then triage the findings, mark false positives as irrelevant, recognize resolved issues, and delegate real fixes back to Build. This creates a controlled loop: implementation, verification, review, triage, repair, and re-review.

Human approval remains necessary. Tests cannot decide whether a product compromise is acceptable. A reviewer cannot infer every operational risk. Merge authority, permission-sensitive actions, and high-impact changes should stay with accountable humans even when agents do most of the mechanical work.

Parallel scheduling and dependency handling

Parallel agents increase throughput only when the scheduler understands dependencies. A backlog is not automatically a set of safe concurrent jobs. The orchestrator should identify shared files, schema dependencies, generated artifacts, ordering constraints, and integration risks before launching workers.

The cleanest fanout pattern is one independent deliverable per environment and branch. Workers can progress concurrently, while the scheduler monitors status and routes blockers. If task B depends on an interface introduced by task A, the scheduler can wait, rebase onto A's branch, or define a stable contract before either worker begins. If two tasks unexpectedly collide, the system should surface the conflict rather than silently combine edits.

Parallelism also applies inside a task. Read-only research, independent code searches, or unrelated verification commands may run concurrently. Editing the same file from multiple workers is a different matter: local speedups can create integration debt. The architecture should maximize useful concurrency, not worker count.

Architectural trade-offs in practice

No coding-agent architecture is universally best. The right choice depends on task size, coupling, environment needs, review requirements, and how much workflow automation a team wants.

Pattern	Strengths	Trade-offs	Best fit
Single session	Fast feedback, low orchestration overhead, direct developer control	Context can become noisy; limited throughput; shared local state	Small fixes and interactive exploration
Planner → worker	Clear specifications, cleaner execution focus, easier delegation	Additional coordination step; plan can become stale if reality differs	Multi-step features and well-scoped backlog work
Multi-agent fanout	Concurrent progress across independent tasks	Conflict resolution, dependency tracking, and integration overhead	Batches of separable work across modules or repositories
Local worktrees	Lightweight branch isolation with familiar local tools	Shared machine resources; setup drift; developer machine remains involved	Desktop workflows with moderate parallelism
Cloud VMs	Strong task isolation, background execution, reproducible toolchains	Provisioning latency, cost, and secrets management	Autonomous work, CI-like environments, and higher concurrency

Public products illustrate these choices without proving that one pattern wins everywhere. Devin's managed Devins documentation describes a coordinator session that scopes work, launches child sessions in isolated VMs, monitors progress, resolves conflicts, and compiles results. That is a multi-agent fanout design suited to work that can be split into packages.

GitHub Copilot cloud agent takes a GitHub-centered approach. It can research a repository, plan, make changes on a branch, execute checks in its own ephemeral GitHub Actions-powered development environment, and optionally open a pull request. Its documented constraints, including one branch and one pull request per assigned task, are meaningful architectural boundaries rather than minor implementation details.

Codex cloud runs background tasks, including parallel tasks, in its own cloud environments and can create pull requests from connected GitHub repositories. The emphasis is delegated execution with configurable environments and tools. Conductor exposes a local-workspace choice: separate isolated workspaces for separate branches, or multiple agent workflows inside one workspace when the work belongs on the same branch.

Capy's design combines a planner → worker split with isolated task VMs, branch-based delivery, and a dedicated review-and-triage loop. That is a strong fit when the goal is not merely to accelerate one coding conversation, but to move several reviewable units of software work through a consistent pipeline.

How to evaluate an architecture

Start with your failure modes, not a feature checklist. Ask whether the system can preserve a clean branch per task, reproduce your toolchain, limit credentials, verify the requested behavior, compact long-running context, and expose an auditable diff. Then ask how it behaves when work is parallel: whether dependencies are modeled, whether conflicts are surfaced, and whether reviewers can understand the origin and status of each change.

Finally, measure outcomes at the pull-request boundary. Useful metrics include acceptance-check pass rate, reviewer correction rate, time from request to reviewable diff, rework after review, conflict frequency, and the proportion of tasks that require human rescue. The best architecture for a team is the one that produces trustworthy, mergeable work with a clear operational model—not the one that simply launches the largest number of agents.

Frequently Asked Questions

What is coding agent architecture for software development?+

Coding agent architecture is the system design that turns a software request into a verified code change. It coordinates planning, model context, isolated execution environments, tools, Git state, tests, review, and human approvals. The architecture matters because a capable model still needs reliable boundaries and feedback loops to produce mergeable work.

Why run coding agents in isolated environments?+

Isolation gives each task a controlled filesystem, dependency graph, branch, and process space. It prevents parallel tasks from overwriting one another and limits the blast radius of shell commands or package installation. A cloud VM also makes it practical to reproduce the toolchain used by CI instead of depending on a developer laptop.

When should a coding system use multiple agents?+

Multiple agents are useful when work can be decomposed into independent deliverables, such as unrelated fixes, separate modules, or a batch of review findings. They are less useful when several workers would compete to edit the same files or when every step depends on the previous one. Good orchestration measures dependencies before fanout rather than assuming more concurrency is always better.

How should coding agents handle long tasks?+

Long tasks need explicit state, bounded context, and a handoff or compaction strategy. The agent should preserve the task specification, completed work, verification results, and next actions while dropping redundant transcript detail. This keeps the working context useful without pretending that an ever-growing conversation is an adequate durable state store.

Does an agent-generated pull request still need human review?+

Yes. Automated verification and agent review can catch many defects, but they do not replace product judgment, risk ownership, or merge authority. A strong architecture places a human approval boundary around consequential actions such as merging, production changes, and permission-sensitive workflows.

Turn agent prompts into a delivery system.

Plan with Captain, build in isolated VMs, review structured findings, and keep every change inside a clear PR boundary.

Try Capy

PreviousCodex Cloud vs Capy NextRunning Coding Agents in Slack

New Capy Pro Tiers: Subscriptions That Scale With You

Agent Architecture for Software Development

Contents

Try Capy Today

Why architecture matters more than code generation

The layers of a coding agent system

Request intake and planning

Execution workers and isolated environments

Tools, MCP, and browser access

State, context, and compaction

Verification, Git boundaries, and review

Parallel scheduling and dependency handling

Architectural trade-offs in practice

How to evaluate an architecture

Frequently Asked Questions

Turn agent prompts into a delivery system.

Try Capy Today