Opus 4.8 Ships With an Orchestration Brain

A big technical week: Claude Opus 4.8 landed alongside a new primitive that lets the model design and run its own agent fleet. The $65 billion raise, the Milan office, the Vatican speech — all real, all loud, all background music to the model and the orchestration shift.

Topic of the Week

Opus 4.8 plus /workflows — Claude writes its own orchestration

The model. Claude Opus 4.8 shipped on Thursday at the same $5 / $25 per-million-token price as 4.7. Anthropic reports it as ~4× less likely than 4.7 to introduce code flaws, 84% on Online-Mind2Web (a web-agent benchmark), and the first model to break 10% on the Legal Agent Benchmark’s all-pass standard — i.e., the share of legal tasks where the model gets every sub-step right, not just the final answer. Fast mode runs at $10 / $50, billed as 2.5× faster and 3× cheaper than the previous fast tier. Available as claude-opus-4-8 from May 28.

The orchestration primitive. Same day, Claude Code v2.1.154 introduced /workflows. You ask Claude to build a workflow for a task; it generates a JavaScript file describing one; that file then fans out across tens to hundreds of parallel subagents, with each subagent returning results directly instead of routing through a central orchestrator. The point: the central planner no longer pays the full context-tax for every subagent answer. Demonstrated on a ~750,000-line Rust codebase. Pairs with a new /effort xhigh setting and a Messages API change — system entries can now be injected mid-task without breaking the prompt cache, which materially changes long-horizon agent economics. You can finally steer an in-flight run without paying to recompute the whole context.

The implication. Long-running agent work — the kind regulated clients keep asking about — moved a notch closer to “this is production-ready” and away from “this is an expensive experiment”. Last week we covered Claude Code versions v2.1.142–146, which made it easy to run several Claude agents in parallel in the background. This week, Claude writes the script that coordinates them — that’s what /workflows does.

The business backdrop. All of this landed in a week where Anthropic also raised $65 billion at a $965 billion valuation (on track for ~$47B annual revenue), opened a Milan office with a list of named Italian clients (Generali, Enel, Pirelli, Satispay, Bending Spoons), and appointed a Korea Representative Director ahead of a Seoul opening.

And the skeptics finally got loud. Larry Ellison argued at Oracle’s earnings call that frontier models trained on the same public-internet text are “rapidly commoditizing” — and that the real competitive advantage will be private enterprise data, not the model itself. The Wall Street Journal ran a piece on AI bears stirring after three years of silence. The Financial Times’ AI capex coverage put 2026 Big-Four cloud-provider spend at $725 billion — up 77% year-on-year, the largest single-year concentrated infrastructure cycle in tech history — and asked whether that pace can ever clear positive ROI before 2030. And Gary Marcus, the cognitive scientist who’s been writing the AI bear case for years, posted the line a lot of investors were thinking on the day of the raise: “was this priced into the $965 billion?” Both bets — Anthropic’s and the bears’ — are now on the table at full size.

Fresh Papers

FluxMem — memory that rewires itself, not just appends (Alibaba). Turns out the memory question isn’t only something we’ve been chewing on internally — it’s the live debate at the research frontier this week too. Last week we covered LongMINT, the benchmark showing every popular agent-memory framework (RAG, MemGPT, MemAgent, SimpleMem) plateaus around a third on long histories. The failure was always the same: agents save every new fact as a new memory instead of updating the old one — so a customer’s address ends up stored three times if it ever changed. FluxMem is the first response we’ve seen that actually addresses this. It treats memory as a three-layer graph (semantic / episodic / procedural) that continuously prunes distractor edges and consolidates repeated successes into reusable procedural circuits — not append-only. The headline result: on a Mind2Web cross-task benchmark, FluxMem more than doubled the success rate of the best prior memory system.

HRBench — when “thinking mode” is actually worth the tokens (Tencent + HKUST). Two clean rules of thumb fall out of the benchmark: on math and science, “prompt-tuning” beats full thinking mode on both axes — slightly better accuracy and fewer tokens. On code, speculative execution wins big but burns more tokens. And with the right routing strategy, you can cut token costs by ~70% while matching the accuracy of always-on thinking. Direct payoff for anyone using Claude Code’s new /effort xhigh setting — don’t crank thinking on math problems, do crank it on code.

New Models

Qwen3.7-Max (Alibaba). Proprietary, top scores across Terminal-Bench 2.0-Terminus, SWE-Pro, SciCode, MCP-Mark, GPQA Diamond, HMMT Feb 2026, and IMOAnswerBench. Runs cleanly across Claude Code, OpenCode, Qwen Code, and custom harnesses — Alibaba is doing the harness-compatibility work most labs skip. The r/LocalLLaMA reaction (“Waiting for Qwen 3.7 open weight — the new King has arrived”, 828 upvotes) tells you where the local-coding crowd is putting its bets. Last week the through-line was Qwen 3.6 on a MacBook; this week Alibaba just posted the cloud benchmarks. The local-coding rope keeps getting thicker.

Claude Code & Coding AI

“The Unreasonable Effectiveness of HTML” — Anthropic engineering post by Thariq Shihipar (Claude Code team). Argument: Markdown was the default agent-output format because GPT-4-era tokens were expensive — but with current pricing, HTML unlocks a much richer artifact (SVG diagrams, interactive widgets, tabs, in-page nav, charts, annotated PR diffs). The companion gallery at thariqs.github.io/html-effectiveness shows ~20 self-contained HTML artifacts generated by Claude — side-by-side comparisons, call-graphs, design-system token previews, browser slide decks, custom editors. Worth a read if you ship analyses, dashboards, or PR-review artifacts as part of a Claude Code workflow.

“How we contain Claude across products” — Anthropic’s first deep engineering post on sandboxing. Walks through what isolation actually looks like in production for Claude.ai, Claude Code, Claude for Chrome, the Files API, and Computer Use. The vocabulary shift is the interesting part: this whole post avoids the word “guardrails” and uses “containment” instead. The piece pairs neatly with Perplexity open-sourcing Bumblebee the same week — a read-only scanner for risky packages, extensions, and AI tool configs on developer laptops.

Tools of the Week

Mistral Connectors API — Public Preview. Mistral promoted MCP from a feature to a first-class API primitive: register an MCP connector once, use it across Le Chat, AI Studio, and the API — plus arbitrary custom remote MCP endpoints, with explicit human approval before any sensitive action. Last week Anthropic shipped private-network MCP tunnels; this week Mistral made MCP a public API surface. The protocol is winning, and “Anthropic-only” is no longer a fair label.

Data Formulator 0.7. Microsoft’s open-source release for natural-language enterprise analytics, aimed at analysts and domain experts who don’t code. The headline feature is a Data Thread — a structured chat that records every question, finding, and chart spec, so the whole analysis is reproducible and reviewable. Audit-trail-by-default — right pattern for regulated clients.

AI at Tenvalleys

The local-model thread became an experiment. Right after last Friday’s edition, the question went up internally — are we going to seriously test running coding agents on local or self-hosted models? Turns out a few people on the team have already been running these experiments in their own time, and one of our engineers came back with a hefty batch of hands-on data they’d been collecting.

The numbers. Gemma 4 31B on a single H100 codes well when paired with Spec-Driven Development — Claude writes the spec, Gemma implements. We clocked ~5–6 minutes per task on a representative case (an XML-to-JSON PII anonymizer). Four parallel agents on the same H100 saw no degradation; eight dropped throughput by ~50%. Consumer hardware is out of the conversation — another teammate tested Gemma 4 31B on a Windows PC with 32GB RAM and an RTX 4070 Ti Super (16GB VRAM) and got ~1 second per code completion. The word that came back: “unusable”.

The pattern that’s getting interesting. Claude does the spec → Gemma does the implementation → Claude writes the tests. If that loop holds, the implementation machine runs overnight without supervision. The business case isn’t only cost — it’s the ability to run onprem, which for regulated clients starts mattering well before cost does. We’re weighing two options: buy a DGX Spark ($4,699, but memory bandwidth is ~11× slower than an H100) or rent H100 time. If you’ve shipped local-model coding agents in production, we’d love to compare notes — reach out at contact@tenvalleys.com.

In the Background

Chris Olah spoke at the Vatican on May 25 alongside the release of Pope Leo XIV’s first encyclical on AI, Magnifica humanitas: On safeguarding the human person in the time of artificial intelligence. Olah told the audience that his interpretability research has found “internal states that functionally mirror joy, satisfaction, fear, grief” inside Claude — and that every frontier lab “operates inside a set of incentives and constraints that can sometimes conflict with doing the right thing”. That’s a striking institutional admission, made on Vatican soil, by an Anthropic co-founder. Expect it to surface in EU AI Act discussions and enterprise risk committees for months.

For Dessert

Google claimed this week that a swarm of Gemini 3.5 Flash agents built an entire operating system from a single prompt, for $916.92 in API fees and ~2.6 billion tokens. Arvind Narayanan and Sayash Kapoor walked through the announcement on Normal Tech (formerly AI Snake Oil) and pointed out a few things: the “single prompt” turned out to be many thousands of lines, disclosed halfway through Google’s own blog post; the OS is the kind of thing undergraduates write as a course project, and public implementations are easy to find on GitHub; and Google released no code, no logs, no prompt, and no similarity analysis showing the agents didn’t simply reproduce existing implementations. Their verdict: “Google’s blog post is effectively a press release… it is unrealistic to expect it to be scientifically rigorous.” The claim is unfalsifiable as published — and useful as a reminder of where the agent-AI marketing gap currently sits.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.