AI
Engineering
23 Apr 26

GPT-5.5 Is the First OpenAI Model We'd Actually Run as Captain.

CaCapy Team, Product Team

GPT-5.5 is OpenAI’s newest model, and it’s now live in Capy. We had early access to the model, and ran internal testing; mainly having it run as the model for our orchestration agent (Captain Capy).

To summarise how we felt from ten days of production testing:

GPT-5.5 feels very much like a smart, more Opus-coded OpenAI model. Earlier models like GPT-5.4 are insanely good at narrow coding tasks, but often feel like talking to a brick wall the moment you need to discuss implementation or actually work through a plan. We found GPT-5.5 was nicer to talk with, took much bigger end-to-end swings, and worked better than any previous OpenAI model we have tested in our Captain harness.

OpenAI told us this model would be more token efficient than its predecessors. We found that to be true. GPT-5.5 showed better tail latency in Captain than every production model we have been running against it, and it did it while still taking on noticeably more ambitious work. Here is the data from our testing (GPT-5.5 data was surfaced from 495 production sessions and 56,587 model calls).

Tail latency by model over a 10-day production window. GPT-5.5 has p95 latency of 16.8 seconds and p99 latency of 35.4 seconds, compared with Opus at 57.7 and 137 seconds and GPT-5.4 at 46.5 and 85.2 seconds.
Tail latency by model over the 10-day production window.

Where GPT-5.5 wins

Three things stood out in real Captain work.

The first is speed you actually feel. GPT-5.5’s p95 sits at 16.8 seconds and p99 at 35.4 seconds. Opus 4.6 is at 57.7 and 137. GPT-5.4 is at 46.5 and 85.2. For an orchestrator that is constantly reasoning over long threads and firing tool calls, that tail behaviour is what makes the whole product feel responsive.

The second is ambition. Compared to earlier OpenAI models, GPT-5.5 stops scope-minimising. It takes much bigger end-to-end swings, and it does not need the usual “please stop being conservative” prompting to actually commit to a direction.

"Way more ambitious. It went end to end, got a lot of work done, and feels much more suitable for our orchestrator agent than previous testing models."

NalinCEO, Capy

The third is tone. GPT-5.5’s debugging updates are short, high-signal, and easy to follow in long threads. That sounds minor on paper, but it matters a lot when Captain is producing most of the text the user actually reads.

Summary diagram showing where GPT-5.5 wins: tail latency, ambition on broad tasks, and concise high-signal debug updates.
Where GPT-5.5 wins: speed, initiative, and concise debugging progress.

Where GPT-5.5 still stalls

Those upsides held through ten days. The downsides got sharper.

The biggest one is that review-and-fix loops still do not converge. GPT-5.5 will fix a reviewer comment in a way that introduces new problems, then re-enter the same loop without catching them. That compounds fast on any PR with more than a couple of review rounds.

"The GPT-5.5 review loop is unusable. It will keep introducing more issues as it fixes review comments. I’m switching to Opus."

Justin SunCTO, Capy

Triage judgement is the next problem. On bug-heavy threads, we repeatedly caught GPT-5.5 delegating bug-finding to subagents prematurely instead of investigating inline as Captain. Motion is not the same thing as progress. A Captain that outsources the thing it should be doing is effectively skipping the actual job.

Task stacking and scope control are also still weak. When multiple fixes compete for attention in one thread, GPT-5.5 tends to creep into adjacent work instead of holding the line on the original ask. Operators repeatedly flagged that GPT-5.5 “doesn’t really know how to stack”.

Summary diagram showing why GPT-5.5 review loops stall: review loops that do not converge, premature delegation to subagents, and weak stacking and scope control.
Why GPT-5.5 review loops still stall: convergence, triage judgment, and scope control.

How it looked in real work

Two recent sessions made the tradeoff concrete.

On our Live2D and native TTS avatar pipeline, GPT-5.5 put together one of the strongest “ambitious execution” traces we have seen from an OpenAI model: broad scope, cross-system surface area, and real delivery momentum all the way through. Across a 24 minute active span, it ran 117 assistant events and 174 tool calls, with roughly 18M input and 32K output tokens. The build touched 95 files (86 added, 9 modified), largely due to bundled runtime and assets.

On a Clerk auth migration incident, GPT-5.5 found the root cause fast and moved to a workable fix path quickly. But the first-pass implementation still showed the recurring GPT-5.5 pattern: take the expedient shortcut first, clean up after review pressure. Useful energy, still needs Captain-level discipline on top.

On review-heavy cleanup threads, operators hit the churn repeatedly: GPT-5.5 fixing comments in ways that introduced new issues, then re-entering the same loop. This is exactly where we still prefer Opus today.

What this means for Capy users

Up until now, most of our users have been running a combination of Opus and Sonnet for Captain and GPT models for Build tasks. After ten days of production testing, we finally feel that OpenAI has a model that can seriously be used for Captain-style orchestration.

GPT-5.5 is a very viable replacement for Opus as Captain. On review-heavy, convergence-sensitive threads, Opus still wins. But if your work is fast-moving debugging, broad task orchestration, or “get me from issue to PR quickly”, GPT-5.5 is the first OpenAI model we would actually reach for as Captain, especially because of how token efficient it is. It’s available in Capy today — and for the next week, it’s served at 50% off.

GPT-5.5 is live in Capy — 50% off for one week.

GPT-5.5 is available now as a Captain and Build model. Use it at half price for the next seven days.