On April 23, Anthropic published an engineering postmortem explaining why Claude Code has felt off for weeks, OpenAI shipped GPT-5.5 pointed straight at Claude’s coding lead, and a day earlier Google upgraded Gemini Enterprise. In this edition: what actually broke inside Claude Code (three stacked bugs, 3% eval drop), how the big three moved in 48 hours, and what it means for anyone picking an agent platform.

Topic of the Week

The April 23 Collision

Anthropic’s postmortem is worth reading in full, but here’s the gist: not an outage — a month-long quality regression in Claude Code, Agent SDK, and Cowork (API users were fine). Three independent bugs stacked. First, on March 4, they quietly lowered default reasoning effort from high to medium to reduce tail latency — Claude literally got less smart, reverted April 7. Second, a prompt-cache optimization meant to prune old reasoning once from idle sessions had a bug that applied it every turn — so Claude was executing tool calls without remembering why it was calling them. Fixed April 10. Third, a single system-prompt line telling Claude to keep responses under 100 words cost 3% on evals for both Opus 4.6 and 4.7. Reverted April 20. The wild part: the caching bug passed human review, automated review, unit tests, e2e tests, and dogfooding. Two unrelated experiments masked it. It took seven weeks to catch.

Same day, OpenAI shipped GPT-5.5 with very agentic framing: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 400K context in Codex, 1M via API. Pricing is $5/$30 per 1M tokens ($30/$180 for Pro), paid-only, no free tier. Brockman pitched it as a step toward a “super app” unifying ChatGPT, Codex, and an AI browser. The timing is not subtle — this is pointed directly at Claude’s coding-agent moat.

Practical takeaway: Anthropic is resetting usage limits for all subscribers as an apology, and they explicitly credited public /feedback reports for surfacing the bug — the feedback loop worked. But pair a seven-week invisible regression with GPT-5.5 landing on the same calendar day, and the message is clear: the coding-AI market just got noisier, the benchmarks got harder to trust, and “test on your actual workload” is the loudest advice we can give this week.

Full postmortem: https://www.anthropic.com/engineering/april-23-postmortem

Fresh Papers

Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents
This is the kind of infrastructure paper that stops being boring the moment you try to run more than a handful of agents in production. The premise is that the whole industry is mid-transition from stateless model inference (send prompt, get response, forget everything) to stateful agentic execution (persistent tools, memory, session state that has to live somewhere between calls). The runtime architectures we have weren’t designed for that — every new agent session rebuilds its context from scratch, which means cost and latency grow linearly with fleet size. Aethon proposes a reference-based replication primitive: instead of reconstructing tools, memory, and session state on every instantiation, you clone a reference state. Constant time, regardless of how heavy the agent’s context is. This is the same pattern showing up across the managed-agents trend — Cloudflare’s Agent Cloud launch, Anthropic’s managed agents engineering post, the general platform-maturity push. If you’re building anything that needs to scale past a demo, this is the primitive you’ll want your runtime to support.

Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review
A survey rather than a benchmark, so no headline numbers, but worth flagging because it maps where bias enters the SDLC (planning, coding, review, deployment) and shows how little of it is actually being measured. If your team uses Claude Code, Copilot, Codex, or Cursor daily, these agents are quietly making systematic choices about which patterns, languages, and frameworks to prefer — and nobody’s really auditing that. This pairs uncomfortably with this week’s Anthropic postmortem: if multi-stage review missed a straightforward eval regression, fairness drift across a coding fleet is almost certainly going unnoticed too.

New Models

GPT-5.5 & GPT-5.5 Pro
OpenAI dropped its next generalist on April 23, framing it as the agentic successor to GPT-5.4 and pushing it into ChatGPT and Codex the same day. Benchmarks go straight at Claude’s coding lead: 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro. Context is 400K tokens in Codex, 1M via API. Pricing is $5/$30 per 1M input/output tokens for GPT-5.5 and $30/$180 for Pro, with 50% off on Batch/Flex and 2.5x for Priority. Latency matches 5.4 per token; a new “Codex Fast” mode runs ~1.5x faster at 2.5x cost. Paid tiers only — no free access. Greg Brockman pitched it as a step toward a “super app” unifying ChatGPT, Codex, and an AI browser. Following last week’s Agent Cloud launch, the generalist-plus-platform combo is taking shape fast. Announcement · TechCrunch

Kimi K2.6 by Moonshot AI
Open-weight release live on Kimi Chat and the Moonshot API, with weights on Hugging Face and endpoints at platform.moonshot.ai. Strong on coding and agentic tasks in both chat and agent modes on kimi.com. Getting traction this week as the go-to alternative in the Reddit thread on Claude Code being pulled from Pro plans.

Qwen3.6-35B-A3B on consumer hardware
Someone got it running at 79 tokens/sec with 128K context on a gaming PC (RTX 5070 Ti + 9800X3D). The unlock is the --n-cpu-moe flag, which offloads MoE experts to CPU so the model fits in consumer VRAM. Concrete proof that serious open-weight coding models now run locally at usable speeds — fuel for the “just run it ourselves” migration story. Reddit

Claude Code & Coding AI

Claude Code unbundled from Pro plan. Anthropic removed the CLI from the $20 Claude Pro tier — Pro users now need Max to run claude against their subscription. The Reddit thread hit 1,388 upvotes with the top comment framing it as “better time than ever to switch to Local Models” — community sentiment tipped toward Kimi K2.6 and Qwen3.6. Anthropic’s head of growth responded on Reddit; community translation: “We gave you Cowork, the CLI is Max-only now.” Worth noting: API keys still work with Claude Code. It’s an unbundling from the subscription, not a product kill.

v2.1.117 — subagent and MCP upgrades. Two things matter here: – Forked subagents on external builds: enable via CLAUDE_CODE_FORK_SUBAGENT=1 env var — previously gated – Agent-level MCP servers: mcpServers in agent frontmatter now loads for main-thread sessions started with --agent, so per-agent tool scoping actually works – Improved /model selection — smoother picker

Release notes: https://github.com/anthropics/claude-code/releases/tag/v2.1.117

v2.1.113 — native binary + sandbox tightening. The CLI now spawns a native Claude Code binary via per-platform optional dependency instead of the bundled JavaScript bundle — faster startup, less Node overhead. Also added sandbox.network.deny for outbound network restrictions in sandbox mode. Release notes: https://github.com/anthropics/claude-code/releases/tag/v2.1.113

One practical note tying to the Topic of Week: the postmortem confirmed Claude Code, Agent SDK, and Cowork users were hit by the regression (pure API was fine). Anthropic is resetting usage limits for all affected subscribers as compensation — check your account.

Also This Week

Google Ships Gemini Enterprise

While Anthropic was writing its postmortem and OpenAI was staging GPT-5.5, Google quietly dropped a major Gemini Enterprise update on April 22 — long-running agents, agentic collaboration spaces, advanced governance, partner-built agents available in the catalog, and a deepened SAP partnership putting Gemini directly inside core business processes. Nothing flashy for consumers, but for anyone deploying AI across a big organization this is the clearest “Google wants the enterprise agent layer” signal yet. Worth keeping on your radar if you’re evaluating agent platforms for regulated or large-org workloads — the three-way race between Claude, OpenAI, and Gemini is real now, not hypothetical.

Tools of the Week

Claude Design by Anthropic Labs
A new tool built on Claude Opus 4.7’s vision capabilities that generates prototypes, pitch decks, and marketing materials while enforcing brand consistency automatically. Aimed at design and marketing workflows — basically Canva-meets-Claude.

AI at Tenvalleys

At one of our banking clients we’re running a development project that leans hard on AI and the BMAD framework (Breakthrough Method of Agile AI-Driven Development) — an open-source approach where AI agents take on the roles you’d find in a real dev team: analyst, product manager, architect, developer, QA. Each role hands work off to the next with a structured spec — the same way humans do — except the agents can run in parallel and never lose the handoff format.

We’re using it to test different approaches to automating code-base migration at production scale. The question we’re trying to answer is the unglamorous one: which method actually survives when you point it at a real legacy codebase, not a toy repo? Different teams in the project are testing different approaches against each other — stay tuned to hear which one wins.

Working on legacy codebase migration and want to compare notes on what’s holding up at scale? Reach out at contact@tenvalleys.com.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.