GPT-5.5 Is the First OpenAI Model We'd Actually Run as Captain.
GPT-5.5 is OpenAI’s newest model, and it’s now live in Capy. We had early access to the model, and ran internal testing; mainly having it run as the model for our orchestration agent (Captain Capy).
To summarise how we felt from ten days of production testing:
GPT-5.5 feels very much like a smart, more Opus-coded OpenAI model. Earlier models like GPT-5.4 are insanely good at narrow coding tasks, but often feel like talking to a brick wall the moment you need to discuss implementation or actually work through a plan. We found GPT-5.5 was nicer to talk with, took much bigger end-to-end swings, and worked better than any previous OpenAI model we have tested in our Captain harness.
OpenAI told us this model would be more token efficient than its predecessors. We found that to be true. GPT-5.5 showed better tail latency in Captain than every production model we have been running against it, and it did it while still taking on noticeably more ambitious work. Here is the data from our testing (GPT-5.5 data was surfaced from 495 production sessions and 56,587 model calls).
Where GPT-5.5 wins
Three things stood out in real Captain work.
The first is speed you actually feel. GPT-5.5’s p95 sits at 16.8 seconds and p99 at 35.4 seconds. Opus 4.6 is at 57.7 and 137. GPT-5.4 is at 46.5 and 85.2. For an orchestrator that is constantly reasoning over long threads and firing tool calls, that tail behaviour is what makes the whole product feel responsive.
The second is ambition. Compared to earlier OpenAI models, GPT-5.5 stops scope-minimising. It takes much bigger end-to-end swings, and it does not need the usual “please stop being conservative” prompting to actually commit to a direction.
"Way more ambitious. It went end to end, got a lot of work done, and feels much more suitable for our orchestrator agent than previous testing models."
The third is tone. GPT-5.5’s debugging updates are short, high-signal, and easy to follow in long threads. That sounds minor on paper, but it matters a lot when Captain is producing most of the text the user actually reads.
Where GPT-5.5 still stalls
Those upsides held through ten days. The downsides got sharper.
The biggest one is that review-and-fix loops still do not converge. GPT-5.5 will fix a reviewer comment in a way that introduces new problems, then re-enter the same loop without catching them. That compounds fast on any PR with more than a couple of review rounds.
"The GPT-5.5 review loop is unusable. It will keep introducing more issues as it fixes review comments. I’m switching to Opus."
Triage judgement is the next problem. On bug-heavy threads, we repeatedly caught GPT-5.5 delegating bug-finding to subagents prematurely instead of investigating inline as Captain. Motion is not the same thing as progress. A Captain that outsources the thing it should be doing is effectively skipping the actual job.
Task stacking and scope control are also still weak. When multiple fixes compete for attention in one thread, GPT-5.5 tends to creep into adjacent work instead of holding the line on the original ask. Operators repeatedly flagged that GPT-5.5 “doesn’t really know how to stack”.
How it looked in real work
Two recent sessions made the tradeoff concrete.
On our Live2D and native TTS avatar pipeline, GPT-5.5 put together one of the strongest “ambitious execution” traces we have seen from an OpenAI model: broad scope, cross-system surface area, and real delivery momentum all the way through. Across a 24 minute active span, it ran 117 assistant events and 174 tool calls, with roughly 18M input and 32K output tokens. The build touched 95 files (86 added, 9 modified), largely due to bundled runtime and assets.
On a Clerk auth migration incident, GPT-5.5 found the root cause fast and moved to a workable fix path quickly. But the first-pass implementation still showed the recurring GPT-5.5 pattern: take the expedient shortcut first, clean up after review pressure. Useful energy, still needs Captain-level discipline on top.
On review-heavy cleanup threads, operators hit the churn repeatedly: GPT-5.5 fixing comments in ways that introduced new issues, then re-entering the same loop. This is exactly where we still prefer Opus today.
What this means for Capy users
Up until now, most of our users have been running a combination of Opus and Sonnet for Captain and GPT models for Build tasks. After ten days of production testing, we finally feel that OpenAI has a model that can seriously be used for Captain-style orchestration.
GPT-5.5 is a very viable replacement for Opus as Captain. On review-heavy, convergence-sensitive threads, Opus still wins. But if your work is fast-moving debugging, broad task orchestration, or “get me from issue to PR quickly”, GPT-5.5 is the first OpenAI model we would actually reach for as Captain, especially because of how token efficient it is. It’s available in Capy today — and for the next week, it’s served at 50% off.
GPT-5.5 is live in Capy — 50% off for one week.
GPT-5.5 is available now as a Captain and Build model. Use it at half price for the next seven days.
