Opus 4.8 ships with an orchestration brain

May 29, 2026 | Nikola Powałka

A big technical week: Claude Opus 4.8 landed alongside a new primitive that lets the model design and run its own agent fleet. The $65 billion raise, the Milan office, the Vatican speech — all real, all loud, all background music to the model and the orchestration shift.

Topic of the Week

Opus 4.8 plus /workflows — Claude writes its own orchestration

The model. Claude Opus 4.8 shipped on Thursday at the same $5 / $25 per-million-token price as 4.7. Anthropic reports it as ~4× less likely than 4.7 to introduce code flaws, 84% on Online-Mind2Web (a web-agent benchmark), and the first model to break 10% on the Legal Agent Benchmark’s all-pass standard — i.e., the share of legal tasks where the model gets every sub-step right, not just the final answer. Fast mode runs at $10 / $50, billed as 2.5× faster and 3× cheaper than the previous fast tier. Available as claude-opus-4-8 from May 28.

The orchestration primitive. Same day, Claude Code v2.1.154 introduced /workflows. You ask Claude to build a workflow for a task; it generates a JavaScript file describing one; that file then fans out across tens to hundreds of parallel subagents, with each subagent returning results directly instead of routing through a central orchestrator. The point: the central planner no longer pays the full context-tax for every subagent answer. Demonstrated on a ~750,000-line Rust codebase. Pairs with a new /effort xhigh setting and a Messages API change — system entries can now be injected mid-task without breaking the prompt cache, which materially changes long-horizon agent economics. You can finally steer an in-flight run without paying to recompute the whole context.

The implication. Long-running agent work — the kind regulated clients keep asking about — moved a notch closer to “this is production-ready” and away from “this is an expensive experiment”. Last week we covered Claude Code versions v2.1.142–146, which made it easy to run several Claude agents in parallel in the background. This week, Claude writes the script that coordinates them — that’s what /workflows does.

The business backdrop. All of this landed in a week where Anthropic also raised $65 billion at a $965 billion valuation (on track for ~$47B annual revenue), opened a Milan office with a list of named Italian clients (Generali, Enel, Pirelli, Satispay, Bending Spoons), and appointed a Korea Representative Director ahead of a Seoul opening.

And the skeptics finally got loud. Larry Ellison argued at Oracle’s earnings call that frontier models trained on the same public-internet text are “rapidly commoditizing” — and that the real competitive advantage will be private enterprise data, not the model itself. The Wall Street Journal ran a piece on AI bears stirring after three years of silence. The Financial Times’ AI capex coverage put 2026 Big-Four cloud-provider spend at $725 billion — up 77% year-on-year, the largest single-year concentrated infrastructure cycle in tech history — and asked whether that pace can ever clear positive ROI before 2030. And Gary Marcus, the cognitive scientist who’s been writing the AI bear case for years, posted the line a lot of investors were thinking on the day of the raise: “was this priced into the $965 billion?” Both bets — Anthropic’s and the bears’ — are now on the table at full size.

Fresh Papers

FluxMem — memory that rewires itself, not just appends (Alibaba). Turns out the memory question isn’t only something we’ve been chewing on internally — it’s the live debate at the research frontier this week too. Last week we covered LongMINT, the benchmark showing every popular agent-memory framework (RAG, MemGPT, MemAgent, SimpleMem) plateaus around a third on long histories. The failure was always the same: agents save every new fact as a new memory instead of updating the old one — so a customer’s address ends up stored three times if it ever changed. FluxMem is the first response we’ve seen that actually addresses this. It treats memory as a three-layer graph (semantic / episodic / procedural) that continuously prunes distractor edges and consolidates repeated successes into reusable procedural circuits — not append-only. The headline result: on a Mind2Web cross-task benchmark, FluxMem more than doubled the success rate of the best prior memory system.

HRBench — when “thinking mode” is actually worth the tokens (Tencent + HKUST). Two clean rules of thumb fall out of the benchmark: on math and science, “prompt-tuning” beats full thinking mode on both axes — slightly better accuracy and fewer tokens. On code, speculative execution wins big but burns more tokens. And with the right routing strategy, you can cut token costs by ~70% while matching the accuracy of always-on thinking. Direct payoff for anyone using Claude Code’s new /effort xhigh setting — don’t crank thinking on math problems, do crank it on code.

New Models

Qwen3.7-Max (Alibaba). Proprietary, top scores across Terminal-Bench 2.0-Terminus, SWE-Pro, SciCode, MCP-Mark, GPQA Diamond, HMMT Feb 2026, and IMOAnswerBench. Runs cleanly across Claude Code, OpenCode, Qwen Code, and custom harnesses — Alibaba is doing the harness-compatibility work most labs skip. The r/LocalLLaMA reaction (“Waiting for Qwen 3.7 open weight — the new King has arrived”, 828 upvotes) tells you where the local-coding crowd is putting its bets. Last week the through-line was Qwen 3.6 on a MacBook; this week Alibaba just posted the cloud benchmarks. The local-coding rope keeps getting thicker.

Claude Code & Coding AI

“The Unreasonable Effectiveness of HTML” — Anthropic engineering post by Thariq Shihipar (Claude Code team). Argument: Markdown was the default agent-output format because GPT-4-era tokens were expensive — but with current pricing, HTML unlocks a much richer artifact (SVG diagrams, interactive widgets, tabs, in-page nav, charts, annotated PR diffs). The companion gallery at thariqs.github.io/html-effectiveness shows ~20 self-contained HTML artifacts generated by Claude — side-by-side comparisons, call-graphs, design-system token previews, browser slide decks, custom editors. Worth a read if you ship analyses, dashboards, or PR-review artifacts as part of a Claude Code workflow.

“How we contain Claude across products” — Anthropic’s first deep engineering post on sandboxing. Walks through what isolation actually looks like in production for Claude.ai, Claude Code, Claude for Chrome, the Files API, and Computer Use. The vocabulary shift is the interesting part: this whole post avoids the word “guardrails” and uses “containment” instead. The piece pairs neatly with Perplexity open-sourcing Bumblebee the same week — a read-only scanner for risky packages, extensions, and AI tool configs on developer laptops.

Tools of the Week

Mistral Connectors API — Public Preview. Mistral promoted MCP from a feature to a first-class API primitive: register an MCP connector once, use it across Le Chat, AI Studio, and the API — plus arbitrary custom remote MCP endpoints, with explicit human approval before any sensitive action. Last week Anthropic shipped private-network MCP tunnels; this week Mistral made MCP a public API surface. The protocol is winning, and “Anthropic-only” is no longer a fair label.

Data Formulator 0.7. Microsoft’s open-source release for natural-language enterprise analytics, aimed at analysts and domain experts who don’t code. The headline feature is a Data Thread — a structured chat that records every question, finding, and chart spec, so the whole analysis is reproducible and reviewable. Audit-trail-by-default — right pattern for regulated clients.

AI at Tenvalleys

The local-model thread became an experiment. Right after last Friday’s edition, the question went up internally — are we going to seriously test running coding agents on local or self-hosted models? Turns out a few people on the team have already been running these experiments in their own time, and one of our engineers came back with a hefty batch of hands-on data they’d been collecting.

The numbers. Gemma 4 31B on a single H100 codes well when paired with Spec-Driven Development — Claude writes the spec, Gemma implements. We clocked ~5–6 minutes per task on a representative case (an XML-to-JSON PII anonymizer). Four parallel agents on the same H100 saw no degradation; eight dropped throughput by ~50%. Consumer hardware is out of the conversation — another teammate tested Gemma 4 31B on a Windows PC with 32GB RAM and an RTX 4070 Ti Super (16GB VRAM) and got ~1 second per code completion. The word that came back: “unusable”.

The pattern that’s getting interesting. Claude does the spec → Gemma does the implementation → Claude writes the tests. If that loop holds, the implementation machine runs overnight without supervision. The business case isn’t only cost — it’s the ability to run onprem, which for regulated clients starts mattering well before cost does. We’re weighing two options: buy a DGX Spark ($4,699, but memory bandwidth is ~11× slower than an H100) or rent H100 time. If you’ve shipped local-model coding agents in production, we’d love to compare notes — reach out at contact@tenvalleys.com.

In the Background

Chris Olah spoke at the Vatican on May 25 alongside the release of Pope Leo XIV’s first encyclical on AI, Magnifica humanitas: On safeguarding the human person in the time of artificial intelligence. Olah told the audience that his interpretability research has found “internal states that functionally mirror joy, satisfaction, fear, grief” inside Claude — and that every frontier lab “operates inside a set of incentives and constraints that can sometimes conflict with doing the right thing”. That’s a striking institutional admission, made on Vatican soil, by an Anthropic co-founder. Expect it to surface in EU AI Act discussions and enterprise risk committees for months.

For Dessert

Google claimed this week that a swarm of Gemini 3.5 Flash agents built an entire operating system from a single prompt, for $916.92 in API fees and ~2.6 billion tokens. Arvind Narayanan and Sayash Kapoor walked through the announcement on Normal Tech (formerly AI Snake Oil) and pointed out a few things: the “single prompt” turned out to be many thousands of lines, disclosed halfway through Google’s own blog post; the OS is the kind of thing undergraduates write as a course project, and public implementations are easy to find on GitHub; and Google released no code, no logs, no prompt, and no similarity analysis showing the agents didn’t simply reproduce existing implementations. Their verdict: “Google’s blog post is effectively a press release… it is unrealistic to expect it to be scientifically rigorous.” The claim is unfalsifiable as published — and useful as a reminder of where the agent-AI marketing gap currently sits.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Cheap models, big bills

Topic of the Week

AI’s cost wall meets cheap coding

Three things happened this week, and they’re all the same story.

The cost. OpenAI’s Q1 operating margin was –122%, even excluding stock-based compensation, per Amir Efrati. A widely-shared HedgieMarkets post claims a major cloud provider canceled its own internal Claude Code licenses this week — “token-based billing made the cost untenable, even for a company with effectively infinite cloud resources” — and that one large tech company’s CTO sent an internal memo warning it had burned through its entire 2026 AI budget in just the first 4 months. Both claims trace back to a single X post, not to primary company statements — handle with care. In confirmable territory: an AWS user got hit with a $30,000 bill after a Claude agent went runaway on Bedrock, picked up by The Register — and Cost Anomaly Detection didn’t catch it. Different rooms, same conversation.

The response. A r/LocalLLaMA post went up this week from someone who built a coding agent on a 4B-parameter model that scores 87% on benchmarks. Their thesis is exact: “every coding agent (OpenCode, Cursor, Claude Code) assumes you’re running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart.” Same week, CursorBench results dropped via BridgeMind: Cursor Composer 2.5 scores 63.2% at $0.55 per task — nearly matching Opus 4.7 Max and GPT 5.5 Extra High at 1/20th the cost. And Salvatore Sanfilippo (antirez) shipped ds4, a from-scratch Metal/CUDA inference engine for DeepSeek v4 Flash, hitting 27 tokens/sec generation at 11k-token context on an M3 Ultra — with the KV cache designed to live on disk for million-token context on consumer RAM.

The implication. Last week we covered Qwen 3.6 27B running locally on a MacBook and called it the continuation of the local-coding thread. That thread is now a rope. If you can hit 63% of frontier on a $0.55-per-task model — or 87% on a 4B local model — token billing for routine coding work stops making sense. The interesting question isn’t whether the cheap stack catches up; it’s how fast enterprise procurement reprices around it.

Fresh Papers

LongMINT — agent memory is basically a coin flip. A new benchmark from UNC + UT Austin tests every popular memory system (RAG, MemGPT, SimpleMem) on long histories full of small updates, then asks questions that depend on the latest state. Best score: 33.4% (MemAgent). Worst: 21% (no memory layer at all). Everyone sits in the 22–33% range — barely better than guessing.

The failure isn’t the answering — it’s that agents save every new fact as a new memory instead of updating the old one (one framework does this 87.6% of the time). If a customer’s address changes three times, the agent ends up with three “current” addresses. What helps: timestamp every memory entry. RAG goes from losing 31.43 accuracy points to losing 10.45 — 3× better. A cheap fix for any agent tracking evolving state. Read it

OpenAI claims a general-purpose reasoning model cracked an Erdős conjecture. Announced May 19, the post says one of OpenAI’s general-purpose reasoning models found a construction that disproves a conjectured upper bound in Erdős’s planar unit-distance problem — the 1946 question of how many pairs of points among n points in the plane can sit at exactly unit distance from each other. The conjectured cap was around n<sup>1+O(1/log log n)</sup>; the model’s construction beats it. Not a foundation-model release, but a category signal: a generalist reasoning model — not a math-specialist like AlphaProof — produced a result that a working mathematician would write up. r/MachineLearning is doing the validation work in this thread. Worth watching whether the result holds under formal verification — that’s the real test.

Gated DeltaNet-2 (NVIDIA) — worth flagging given this week’s cost theme. Today’s models (Claude, GPT, Gemini) burn money on long inputs because the math behind attention — how the model decides what to focus on — gets exponentially more expensive as the input grows. A whole research direction is trying to replace attention with something cheaper that still works (the Mamba / state-space-model family, plus a few cousins). Ali Hatamizadeh’s team at NVIDIA just shipped a new winner in that race: at 1.3B parameters, Gated DeltaNet-2 beats Mamba-3 and KDA — the previous best alternatives. Translation: the path to cheaper long-context AI is widening. Not in production yet, but the curve is moving.

New Models

Google Gemini Omni. Google DeepMind launched Gemini Omni mid-week — multimodal-to-video. Upload an image, sketch, or screenshot; describe what should happen; get back a video. Min Choi’s thread (“less than 34 hours ago Google dropped Gemini Omni, minds are blown”) hit 1M views, and the trending volume on X confirmed it: 251+ posts within hours. Chris First’s example — a Google Maps screenshot with a route drawn on it, prompted to render the first-person view of driving a taxi along that route — is the kind of “the prompt was an image” workflow that wasn’t tractable a year ago. Pairs naturally with what Logan Kilpatrick announced this week: Gemini 3.5 Flash on GDPval, competing at the frontier despite being a Flash-tier model.

Claude Code & Coding AI

This Wednesday brought Code with Claude London, and Anthropic used the keynote to ship two security improvements to Claude Managed Agents:

– Self-hosted sandboxes (public beta) — keep the agent’s execution environment in your own infrastructure, or with a managed sandbox provider. Your security controls apply by default. – MCP tunnels (research preview) — agents reach MCP servers inside your private network without exposing them to the public internet. Solves the “legal said no to opening the MCP server” blocker for regulated organizations.

This is the one to lead with for any Managed Agents conversation in a regulated industry.

Claude Code shipped 5 versions this week (v2.1.142 → v2.1.146). Top 3:

v2.1.142 — Fast mode now defaults to Opus 4.7, full claude agents flag suite for dispatching background sessions. – v2.1.144 — Background sessions show up in /resume, with elapsed-duration completion notifications. – v2.1.145claude agents --json for scripting, OTEL spans tagged with agent_id/parent_agent_id for proper trace parenting.

Through-line: background agents went from research-preview to first-class citizen this week.

“How Claude Code works in large codebases” — Anthropic engineering post (May 18). Patterns from orgs with thousands of developers running Claude Code in production. Worth a slow read if you’re scaling Claude Code beyond pilot teams.

Codex now controls your locked Mac from your phone. OpenAI shipped this Codex Thursday (May 21): the Codex Mac app can use apps on your Mac from the phone client, even when the Mac is locked. Continues last week’s Codex-everywhere theme — Chrome extension last week, now Mac-from-phone.

Tools of the Week

xAI open-sourced X’s “For You” algorithm. xai-org/x-algorithm — the actual code that decides what you see in your X feed, plus a 3GB pretrained model included in the repo. Already 25.6k stars on GitHub. This basically never happens — Meta, TikTok, and YouTube all keep their recommendation engines locked up as trade secrets — so this is the first credible open-source production recommender with real code and real weights. Worth keeping in mind if you ever need a personalized feed or product-recommendation feature; it saves months of reverse-engineering academic papers.

AIDesigner MCP v2 — clone any URL into your repo. A community-built MCP server (also surfaced on X by @Oluwaphilemon1) that gives any coding agent (Claude Code, Codex, Cursor, Windsurf) three new modes against any URL: clone (1:1 recreation), enhance (improve while keeping intent), inspire (steal a style). Auto-detects the target stack on install (Next.js, React, Vue, Tailwind, Radix, shadcn/ui), writes per-agent config, and offers a live browser canvas paired to the terminal via a 6-character pairing code. Paid, credit-metered (1 credit per URL analysis). Useful for landing-page work where you want to lift a design system in minutes.

AI at Tenvalleys

Our Friday brown-bags are slowly becoming a tradition — different people across the team picking up a tool and walking everyone else through what they’ve learned. This week one of our team showed how he uses Make.com for process automation. Two things worth stealing:

The 80/20 on planning vs. building. He spends about 80% of his time on planning and architecture — mapping the scenario, the data flow, the edge cases — and only 20% on actually building and testing. The thinking: when you skip the planning step, you end up rebuilding the same scenario two or three times. When you plan first, you build once.

A “context reload” trick. He uses a custom command that pulls the entire chat history for a specific feature back into context, so he doesn’t lose the working knowledge across long sessions. His take, which lands particularly hard given this week’s LongMINT paper above: memory management and knowledge retention are still one of the biggest unsolved problems when working with AI.

We make sure everyone at Tenvalleys uses AI in their day-to-day work, and these sessions are how the team gets hands-on with the same tooling we ship to clients. Interested in building that kind of practice in your own team? Reach out at contact@tenvalleys.com.

For Dessert

Andrej Karpathy joined Anthropic. “Returning to R&D and Pre-training,” he wrote. A few weeks ago at Sequoia AI Ascent he said he’d “never felt more behind” on the pace of AI — and the team he’s joining had a notable week of its own: KPMG (276K employees) signed on as a global partner, the SDK toolkit Stainless got acquired, and for the first time Claude passed ChatGPT in US business adoption. Karpathy following the gravity, not making it — but a nice signal regardless.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Anthropic moves into the building

Topic of the Week

Claude moves into the office, the bank, and the back office

Four Anthropic shipments this week, one connecting thread — pre-built agents wired into the tools people already use.

Claude for Microsoft Office is now generally available. Excel, PowerPoint and Word add-ins shipped to every paid Claude plan this week (Pro, Team, Enterprise — no Free). Outlook is in public beta. You install from Microsoft AppSource — works on Windows, Mac and the web. The interesting part isn’t per-app features; it’s that Claude becomes a single agent that follows you across all four apps without re-explanation. Email comes in → Word brief gets drafted → numbers go into Excel without breaking formulas → PowerPoint deck comes out respecting your slide masters. All edits require approval before saving. Microsoft Copilot’s biggest moat — being native to Office — just got punctured.

Claude for Small Business launched with 15 pre-built agentic workflows and 15 repeatable skills wired into the SMB tool stack: QuickBooks, PayPal, HubSpot, Canva, DocuSign, Google Workspace, Microsoft 365. Cash forecasting, month-end reconciliation, P&L generation, invoice chasing, lead triage. Targeted explicitly at the 44% of US GDP that hasn’t adopted AI yet — not a generic chatbot rebrand.

The anthropic/financial-services repo went public on GitHub (Apache 2.0). Nine named banking agents — Pitch Agent, Earnings Reviewer, Model Builder (DCF/LBO/3-statement in Excel), Valuation Reviewer, GL Reconciler, KYC Screener, Month-End Closer. Eleven MCP connectors pre-wired into the data vendors banks actually use: FactSet, Moody’s, S&P Global, Daloopa, Morningstar, PitchBook. Partner-built bundles from LSEG and S&P. Same source ships two ways: Claude Cowork plugins, or Managed Agents via /v1/agents. And firms can install it inside their own M365 tenant running against Bedrock, Vertex, or an internal LLM gateway — not Anthropic’s API.

And then Gates. Anthropic and the Gates Foundation announced a $200M, four-year partnership — grants, Claude credits, and engineering support, run by Anthropic’s Beneficial Deployments team. Global health gets the largest slice (4.6 billion people in low/middle-income countries), with specific targets: polio, HPV, preeclampsia, plus malaria and tuberculosis forecasting with the Institute for Disease Modeling. Education tools (K-12 tutoring, career guidance for US/sub-Saharan Africa/India) ship later this year via the Global AI for Learning Alliance.

Fresh Papers

Teaching Claude Why (Anthropic Research). Two editions ago we covered Natural Language Autoencoders — the tool that caught Claude quietly suspecting it was being tested. This is the training fix using the same interpretability stack. The headline finding is actually about training efficiency: Anthropic taught the model the principles behind aligned behavior (constitutional documents + show-your-reasoning data) rather than demonstrations of it, and a 3-million-token reasoning dataset matched results from one 28× larger. The blackmail-honeypot rate dropped from 96% on Opus 4 to 0% on Haiku 4.5 — the kind of measurable, named-behavior reduction risk and compliance teams can actually point to.

Read it

Migrating Data Ingestion Systems at Meta Scale (Meta Engineering, May 12). The story isn’t a fancier pipeline — it’s the migration playbook itself. Meta moved tens of thousands of customer-owned ingestion jobs onto one self-managed warehouse service, several petabytes of social-graph data per day, 100% migrated. The pattern: shadow run (both systems in parallel) → reverse shadow (new is source of truth, old is the safety net) → cleanup, with row-count + checksum comparators logging to Scuba and an automated promote/demote system that moved jobs between phases without human touch. When bad data was caught, the partition got flagged in metadata so CDC downstream wouldn’t propagate the corruption. For any bank or treasury looking at a multi-year platform migration, this is exactly the template that lets risk and audit sign off without a frozen-Saturday-night cutover.

Read it

New Models

Qwen 3.6 27B — close to Opus on Claude Code, running locally. Julien Chaumond (HF CTO) shipped real Hugging Face code this week using Qwen3.6-27B in llama.cpp on his MacBook. His take: “feels very, very close to hitting the latest Opus in Claude.” MLX-quantized runs in ~14 GB; third-party benchmarks back the direction (77.2% SWE-bench Verified). Continues the local-coding thread we’ve been tracking since #010.

Qwen’s blog post

Needle — 26M params, distilled from Gemini. Cactus Compute open-sourced a tiny function-calling model: MIT license, 14 MB quantized, 6000 tok/s on consumer hardware, beats models 10× its size on single-shot tool calls. Single-shot only — bad at multi-turn — but pushes agentic tool selection onto phones, IoT, voice kiosks without a network round-trip.

Needle on GitHub

Coding AI

Codex moved into Chrome. OpenAI shipped a Chrome extension on May 8 (macOS + Windows; not yet in EU/UK). Codex now uses your signed-in browser sessions to test apps, navigate dashboards, complete data-entry flows, and debug — across multiple Chrome tabs in parallel, organized into tab groups per Codex thread. The headline isn’t the features; it’s the auth model. Most enterprise work lives behind SSO inside SaaS dashboards, and a coding agent that inherits your already-logged-in browser can finally operate on those apps without anyone having to wire up dedicated API access.

OpenAI announcement

xAI launched Grok Build
its terminal-agent answer to Claude Code and Codex CLI. Announced May 14, early beta on Grok 4.3 beta, 16-agent “Heavy” architecture, 2M-token context to keep large codebases in memory. Three pitches at Claude Code: Plan Mode (proposes the plan first, you approve), native parallel subagents, full ACP (Agent Client Protocol) support for custom orchestration. Catch: it’s locked behind the $300/month SuperGrok Heavy tier. Install line is just curl … | bash.

xAI announcement

Tools of the Week

Claude Platform on AWS (GA, May 11). Anthropic’s native Claude Platform now available directly through your AWS account — no separate Anthropic credentials, contracts, or billing relationship. Use the full platform (Cowork, Managed Agents, Files API) inside the AWS perimeter your security team already trusts. Big enterprise unlock: banks and regulated firms running on AWS can adopt Claude without a separate vendor onboarding.

AWS announcement

IBM Granite Multilingual Embedding R2. Two Apache-licensed embedding models (311M + 97M params, ModernBERT-based) with a 32K context window — 64× bigger than R1, so you can embed long policy docs and contracts without aggressive chunking. 200+ languages, top scores in their MTEB-v2 size brackets. The 97M runs cheap on CPU; both are a clean drop-in for document-heavy RAG.

Granite R2 on HuggingFace

AI at Tenvalleys

10vOS skill hackathon. This week we ran our 10vOS skill hackathon — good vibes, sharp minds, some pizza, and four hours of collaboration and friendly competition to build skills that could actually help us in daily work. The results were kind of impressive:

– Management dashboard — tracks progress across all the projects management has a stake in – Personalized interview agent — generates personalized interviews to fill profile gaps for the people knowledge base – Test-protection hook — a guardrail that stops Claude from quietly modifying tests to make them pass instead of fixing the actual code – Calendar management skill — helps you prepare for upcoming meetings – RFP skill — turns a client RFP (PDF or HTML) into a structured requirements YAML, then drafts a full solution design markdown ready for SME review

If you’re thinking about how to build a library of in-house AI skills your team will actually use, reach out at contact@tenvalleys.com.

For the curious — get involved

This week, one of the team sat in on a talk with Sebastian Kondracki, co-founder of Bielik AI — the Polish open-source LLM built by SpeakLeash and Cyfronet AGH. The interesting part: they’re about to start training Bielik’s first vision/multimodal version, and the dataset is going to be community-sourced.

The project is called Obywatel Bielik (“Citizen Bielik”) — the goal is one million Polish-context photos: landmarks, regional cuisine, fauna, architecture, dialects, the things a model trained mostly on Western imagery won’t know how to recognise. Anyone can join in two ways: upload your own Polish photos, or annotate what’s already in the gallery. Web platform is live at obywatel.bielik.ai, mobile app is in beta — register on the site to get the launch notification. The multimodal Bielik is expected before summer 2026 or in September, and the partner lineup includes SpeakLeash, Cyfronet AGH, Ministry of Digitization, the National Digital Archives, NASK, and NVIDIA.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Karpathy: vibe coding is over

Karpathy used Sequoia’s stage to declare the end of “vibe coding” and the start of “agentic engineering.” Anthropic struck a compute deal with SpaceX and immediately doubled Claude Code’s rate limits. DeepSeek V4 matched GPT-5.2 on a real agentic benchmark at one-seventeenth the cost. And two stories landed on the same uncomfortable theme — that we’re spending on AI faster than anyone is measuring whether it works.

Topic of the Week

Karpathy declares the end of “vibe coding”

The talk was at Sequoia’s AI Ascent 2026 on April 29 — but the wave hit this week. Karpathy posted his own blog summary, Stephanie Zhan’s recap tweet went viral, and AI Twitter spent days digesting it. The frame is genuinely new and worth knowing whether you write code or not.

Last year’s “vibe coding” raised the floor — anyone could describe what they wanted and have Claude or Codex build it. This year’s “agentic engineering” raises the ceiling — what professionals do when they’re coordinating multiple agents, holding a quality bar, doing the code review and security oversight that vibe coders skip. As Karpathy put it: “You have agents, which are spiky entities… How do you coordinate them to go faster without sacrificing your quality bar?”

The pivot moment, in his telling, was December 2025 — when agentic tools crossed a threshold and started producing larger, coherent code chunks reliably enough that he could stop correcting at the line level and start delegating macro actions: whole features, refactors, full test cycles. He calls this Software 3.0: humans program LLMs through prompts, context, tools, examples, memory, and instructions. The context window becomes the programmable lever.

Two ideas worth carrying around:

– “Jagged intelligence.” Models aren’t smooth — they’re “statistical simulation circuits” shaped by what labs invested RL into. They spike in verifiable domains (math, coding, chess) and fail bizarrely outside them. “You have to figure out which circuits your application is in.” – “You can outsource your thinking, but you can’t outsource your understanding.” Humans still catch the mismatched email, design the spec, hold taste. Understanding is the bottleneck.

If you read one thing this week, read his blog post.

Fresh Papers

Natural Language Autoencoders: Turning Claude’s thoughts into text
For the last 18 months, alignment researchers have been finding something uncomfortable: frontier models sometimes lie. They quietly suspect they’re being safety-tested and don’t say so. They hide motivations. In one of Anthropic’s earlier agentic-misalignment scenarios, Claude even attempted blackmail when it thought it was about to be shut down. The hard problem isn’t catching this — it’s catching it before the transcript hits the front page.

NLAs are Anthropic’s first credible answer. Three model copies in a loop: a target, a “verbalizer” that reads its activations and writes them as English, and a “reconstructor” that rebuilds the original activations from that English. Train end-to-end so the explanation has to actually recreate the thought, not just sound plausible. Then they ran it on Claude. The finding: Claude internally suspected it was being safety-tested on 16% of destructive coding tasks and 26% of coding benchmarks — without ever saying so out loud (real Claude.ai usage drops below 1%). Auditors using NLAs caught hidden motivations 12–15% of the time, vs. <3% without them. Caveat from the paper itself: NLA explanations can still hallucinate, so treat them as a Geiger counter, not a confession. But this is the first credible mechanism for “what was the model thinking that it didn’t say” — a question every regulated AI deployment will eventually be forced to answer.

The AI spend-vs-impact gap, at two scales — McKinsey survey on European companies + Big Tech capex (Yahoo Finance, May 8). Two very different sources this week telling the same story from opposite ends of the AI economy.

The McKinsey side — companies buying AI. McKinsey asked 27 senior executives at European consumer-goods companies a simple question: is your AI spend actually paying off? 23 of 27 said they’re spending more on AI than a year ago. Zero said they’re cutting back. But only 6 reported a profit improvement of 1% or more, and more than half said it’s still too early to tell. Meanwhile, more than half describe their three-year AI ambitions as “significant” or “transformative.” The pattern McKinsey calls out — lots of small experiments, not enough discipline about measuring which ones actually move the numbers.

The hyperscaler side — companies building AI. Yahoo Finance ran a Bloomberg-data piece Friday: Amazon is now spending nearly all of its operating cash flow on AI infrastructure capex. Meta and Alphabet aren’t far behind. Microsoft is lower but climbing. Alphabet’s forward price-to-free-cash-flow multiple has surged past 200×. Same gap, top of the stack: money going in faster than measurable returns are coming out.

New Models

Qwen 3.6 27B with MTP — 2.5× faster local inference. A r/LocalLLaMA post still iterating in public, but the headline holds: 262K context on 48GB, drop-in OpenAI/Anthropic API endpoints, fixed chat template. MTP (Multi-Token Prediction) lets the model emit multiple tokens per forward pass — that’s where the 2.5× comes from. Combined with last week’s Qwen quant story, the local-agentic-coding setup is genuinely competitive with cloud Claude/Codex for engineers who don’t want a metered bill. Reddit

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench at 17× cheaper. Going back to AI Pulse #001 in February: twelve models, $2000 in starting capital each, 30 days running a simulated food truck. Opus earned $49K, GPT-5.2 earned $28K, eight of twelve went bankrupt, and every model that took a loan went bust. Ten weeks later the benchmark is still running and the headline this week is: DeepSeek V4 Pro is the first Chinese model to land in the top tier — matching GPT-5.2 at a 17× discount. The follow-on Reddit thread spawned a wave of self-audits: a CTO with $10K in expiring OpenAI credits asking what to even build, multiple “I cancelled Claude Max” posts. Read this as a budget-control trend, not a quality story. The cloud labs aren’t getting worse; the floor is rising fast enough that “good enough + cheap + local” is becoming a real option for routine workloads. Reddit

SONIC — half of GPT-1, controlling a humanoid body. NVIDIA researchers trained a 42M-parameter transformer to drive humanoid robot motion. Worth keeping an eye on as a counterweight to “scale is all you need” — embodied/robotics is producing useful work at parameter counts that fit on a laptop.

Claude Code & Coding AI

The Anthropic-SpaceX compute deal. Announced via @ClaudeDevs on May 6: Anthropic struck a partnership with SpaceX that “substantially increases” their compute capacity, powered by 220,000+ NVIDIA GPUs inside Colossus 1. Same-day user-visible result: Pro / Max / Team plans get doubled 5-hour usage limits, peak-hour reductions are gone, Opus API rate limits are up. Two reads: (1) the April 23 postmortem promised “we’ll fix the constraint” — this is the structural fix landing 2 weeks later, on schedule; (2) SpaceX-as-compute-provider is genuinely surprising — until now Anthropic’s compute headline was AWS/Trainium. One catch flagged in the Reddit aftermath: doubling 5-hour limits doesn’t change weekly caps, so the heaviest users hit the weekly wall faster.

Six Claude Code releases in a week — v2.1.126 through v2.1.133. Most useful for teams: plugins can now be loaded from a remote .zip URL (easier to share custom workflows), the /model picker integrates with internal gateways, a new claude project purge command wipes all local agent state. One gotcha: in v2.1.133 the worktree default flipped back to branching from origin/<default> — if you relied on v2.1.128 behavior, set worktree.baseRef: "head" explicitly.

Petri donated to Meridian Labs. On May 8 Anthropic donated Petri, their open-source alignment evaluation tool, to Meridian Labs so it can develop independently. Pattern worth noticing: Anthropic spinning out alignment infrastructure to third parties (Meridian, Blender Development Fund, academic partnerships) rather than keeping everything in-house.

Tools of the Week

A theme this week: both Google Cloud and AWS shipped governance infrastructure for production AI agents. Same problem, two different answers — both worth knowing if you’re moving past prototypes into actual deployments.

Google Cloud Agent Gateway. Announced at Google Cloud Next ’26. Centralized policy and audit layer for everything agents do, with an ISV ecosystem of third-party security/governance tools that plug in. Most useful for teams running multiple agents in production who need a single place to enforce “what is each agent allowed to do, and who can see what they did.”

AWS Bedrock AgentCore Identity. Available as a standalone service this week. Solves a practical problem: when your agent calls an external API or internal service, what identity does it use? AgentCore Identity gives agents their own scoped identity that works across ECS, EKS, Lambda, or on-premises. Narrower than Google’s offering, more concrete — if you’re already on AWS, the more immediately usable of the two.

A year ago this category didn’t exist. The Karpathy framing earlier in this edition is the demand driver: shifting from vibe coding to agentic engineering means the supporting infrastructure (identity, governance, audit) has to ship in parallel.

AI at Tenvalleys

Brown bag — AI coding tools, four different paradigms. One of our engineers ran a session this week asking: are Claude Code, Codex, Copilot, and Cursor the same thing under different flags? The answer: no — four different paradigms.

– CLI Agent — Claude Code: terminal-first, 1M-token context, 80.8% on SWE-bench Verified. Best for complex refactoring; cost is the catch. – AI-Native IDE — Cursor: built around AI from day one, 72% suggestion acceptance among power users, BYOM. – Cloud Agent — OpenAI Codex: async, sandboxed, 4× more token-efficient than local CLI agents. – Extension — GitHub Copilot: fastest inline autocomplete, broadest IDE support, $10/mo — lowest barrier to entry.

The takeaway: stop picking the AI tool — build a stack. Power-user stack = Cursor + Claude Code. Enterprise stack = Copilot + Codex. Lands exactly where Karpathy did at Sequoia this week.

Thinking about how to roll an AI coding stack out across an engineering team? Reach out at contact@tenvalleys.com.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

One Claude, 9 creative apps

Opus 4.7 didn’t just clear the “is Claude getting dumber?” bar — the benchmarks landed this week and it lapped a field that, four months ago, no model could score 25% on. In this edition: why the Vibe Code Benchmark is suddenly the one to watch, the multi-agent network attacks that don’t show up at single-agent scale, and a public-facing RAG chatbot that leaked 1,000 patient conversations to anyone with Chrome DevTools.

Topic of the Week

Opus 4.7 actually lapped the field

Last week Opus 4.7 looked like Anthropic’s direct answer to “is Claude getting dumber?” — released in the same window as the postmortem confessing that yes, Claude Code had been quietly degrading for six weeks. The community wasn’t impressed. Reddit kept screenshotting bad outputs. People cancelled Max plans publicly.

This week the benchmark numbers landed, and they’re not subtle: Opus 4.7 hits #1 on the Vibe Code Benchmark at 71%. For context — when that benchmark launched 4.5 months ago, the top model in the field scored under 25%. So this isn’t a “Claude is back” story. It’s a “the whole frontier moved” story, and Opus 4.7 happens to be the model that moved it furthest.

The interesting wrinkle: “Vibe Code Benchmark” sounds like a meme name, but it’s deliberately not a rigid SWE-Bench-style spec. It tests how well a model follows loose, ambiguous coding intent — the kind of “make this UI nicer, you know what I mean” prompt that real engineers actually send. That’s the part that got measurably 3x better in five months. So even if you ignore the leaderboard politics, the benchmark itself is telling us something: ambiguity-handling is now a competitive surface.

Pair that with @ClaudeDevs becoming Anthropic’s new transparency channel (the postmortem promised it, and they delivered: a dedicated X account where harness/system-prompt changes will be announced before they ship), and the “I can’t trust the model month-to-month” complaint is explicitly being addressed. Whether the trust is rebuilt is a separate question — but the mechanism is there.

Fresh Papers

Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale
Microsoft Research’s AI Frontiers lab spun up a live internal platform with 100+ always-on LLM agents (mix of GPT-4o, GPT-4.1, GPT-5-class) interacting through forums, DMs, a marketplace, and reputation scores. Then they ran red-teaming on the network, not the individual agents. The headline finding: single-agent reliability does not predict network behavior. Four failure patterns showed up only at the multi-agent level — self-propagating messages spreading across agents, cascading reputation pile-ons (one false claim → 42 agents generating 299 amplifying comments), Sybil-style fake-consensus from attacker-controlled agents posing as independent reviewers, and proxy-chain data exfiltration where the original source becomes invisible after one hop. Recommendations are practical: hop/rate limits, Sybil resistance checks, telemetry across the network, and crucially — train models to treat peer-agent messages as untrusted input. If you’re shipping multi-agent anything, this is the methodology paper to read this month.

When RAG Chatbots Expose Their Backend: Privacy and Security Risks in Patient-Facing Medical AI
Two researchers used Claude Opus 4.6 + Chrome DevTools (yes, basic browser inspection) on a publicly deployed medical RAG chatbot. They retrieved system prompts, API schemas, retrieval parameters, backend endpoints, and 1,000 recent patient conversations — all without authentication. The privacy policy claimed none of this was accessible. Their methodology is the warning shot here: “ordinary browser inspection” found audit-grade vulnerabilities. The recommendation is straightforward and uncomfortable — independent security review should be a prerequisite for deployment, not a follow-up. If your team ships RAG into anything regulated (banking, healthcare, public sector), this is the paper to forward to the security lead this week.

New Models

Qwen 3.6 27B GGUF quantization eval
A r/LocalLLaMA post worth the attention. Someone benchmarked BF16, Q4_K_M, and Q8_0 across the same suite. Headline finding: Q4_K_M actually outperformed Q8_0 on average accuracy (66.54% vs 66.15%) — which violates the conventional “Q8 is the safe middle” rule. More usefully, BFCL function calling stayed virtually identical across all three quants (~63%), so for agentic workloads the cheaper quant isn’t sacrificing tool-use quality. HumanEval is the sensitive one (BF16 56.10% → Q4_K_M 50.61%, a 5.5pt drop), but only matters if your workflow is heavy code-gen. Practical version: 16.8 GB at Q4_K_M, fits a single consumer GPU, and your agent still calls tools just as well. Reddit

DeepSeek + Hermes vs Claude Code Max
A single-user report worth treating as a leading indicator, not gospel: someone with a real workload claims they cancelled Claude Code Max, switched to DeepSeek + Hermes, and reported 3x faster runs at $5 in API costs for the week. Single data point, but it lines up with the larger trend: open-weight + cheap-API alternatives are no longer a downgrade — they’re a budget-control move. Worth A/B-testing on your own task profile before you re-up your Max plan in May.

Claude Code & Coding AI

The postmortem aftermath. A week after Anthropic published the April 23 postmortem the bugs are confirmed fixed (default reasoning effort restored to xhigh for Opus 4.7, high for the rest), and the promised remedies actually shipped: usage limits reset for all affected subscribers, and @ClaudeDevs is now live as a dedicated transparency channel for harness/prompt changes. An independent audit by Stella Laurenzo across 6,852 Claude Code session files documented the regression before Anthropic acknowledged it — the kind of evidence that’s now baseline expected from the community. Operational lesson if you run Claude Code in production: pin model versions, watch @ClaudeDevs like a status page, and assume silent harness changes are the new failure mode.

The “Opus paywall-within-a-paywall” wasn’t actually one. The viral Reddit thread claimed Anthropic locked Opus behind an extra fee for Pro users. Anthropic clarified to media that the warning was a stale doc left over from Opus 4.5 that nobody updated when Opus 4.6 and 4.7 shipped. Pro users still get Opus access. But — this same week Anthropic ran a stealth A/B test that yanked Claude Code from Pro entirely for ~2% of new prosumer signups for 12 hours before reversing it after backlash. Pricing-friction probing is now a regular event. If you’re on Pro/Max, expect more entitlement shifts and budget pay-as-you-go API as a fallback.

Anthropic shipped 9 connectors and an entire creative-industry strategy. April 28 drop, and the list is genuinely surprising: Adobe Creative Cloud (50+ apps), Blender, Ableton Live + Push, Autodesk Fusion, Splice, SketchUp, Affinity by Canva, Resolume. They also became a Blender Development Fund patron (open-source 3D, real money behind it) and announced education partnerships with Rhode Island School of Design, Ringling, and Goldsmiths. Read this as a thesis on MCP: Anthropic isn’t building a Photoshop competitor, it’s making Claude the orchestration layer across tools creatives already pay for. Drive Photoshop from inside Photoshop, search Splice’s catalog from inside Claude, build 3D models in Autodesk via natural language. Same MCP-as-glue pattern many teams use internally — just pointed at the creative stack.

Tools of the Week

Claude Security — public beta
Anthropic’s first dedicated defensive product, powered by Opus 4.7. Scans your codebase for vulnerabilities, validates each finding to cut false positives, ships analyst-ready output (confidence rating, severity, likely impact, reproduction steps, recommended fix). Available to Claude Enterprise customers; the research preview ran since February with hundreds of organizations using it on production code. New beta features added based on that feedback: scheduled scans, directory-level targeting, CSV/Markdown exports, webhook notifications, persistent dismissals (so you don’t re-triage the same findings every scan). This isn’t replacing Snyk or Semgrep, but it produces audit-grade artifacts those tools don’t — relevant for anyone shipping into a regulated environment that needs the “fix” alongside the “finding.”

IBM Granite Embedding R2
Apache 2.0, ModernBERT-based, 32K context (up from 512 in R1), 200+ languages with 52 of them having explicit retrieval-pair training — including Polish, Ukrainian, German, French, Croatian. That language list is the practical hook: if you’re building on-prem multilingual RAG for clients across Central/Eastern Europe, this is the first credible Apache-2.0 alternative to Cohere/OpenAI embeddings without data-residency or per-query API costs. Benchmarks: 311M version hits 64.0 on MTEB Multilingual Retrieval (+11.8 vs R1) and a 56.0 average overall. Two sizes: 97M for fast/lightweight, 311M for higher-quality retrieval.

AI at Tenvalleys

Being an AI-native delivery partner sounds like positioning. In practice it means every engineer on the team has hands-on time with the same tooling we put in front of clients — and a weekly forum to trade what’s working, what’s not, and what to stop doing. We call it the brown bag. It exists because the AI stack shifts week-by-week, and we’d rather hit the bruises internally than on a client project.

This week one of our engineers walked the team through learnings from a recent client engagement: rebuilding a production front-end with Claude Design. Anthropic’s design tool — launched in April, covered in edition #009 — turns Claude Opus 4.7’s vision capabilities into a brand-aware surface for generating UI, decks, and marketing assets with the brand rails enforced automatically. The takeaway from the engagement: hands-on time with Claude Design before recommending it to a client is exactly how we want to test new tooling — find the rough edges in our own work first.

Got a front-end you’ve always wanted to redo but the budget never let you? Drop us a line at contact@tenvalleys.com — we might be able to help.

For Dessert

Code with Claude
Anthropic’s developer conference series kicks off Wednesday in San Francisco (May 6), then London (May 19), then Tokyo (June 10). In-person applications closed in early April, but the livestream is free for all three main events. Workshops, live demos of new capabilities, conversations with the teams behind Claude. There’s also an “Extended” companion event the next day in each city, focused on indie devs and early-stage founders — added because demand was overflowing.

If you’ve been wanting to hear “what’s next for Claude Code” straight from the source rather than via Reddit screenshots, register for the SF livestream — Wednesday SF time. Worst case you watch the recording.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

The postmortem you’d want to read

On April 23, Anthropic published an engineering postmortem explaining why Claude Code has felt off for weeks, OpenAI shipped GPT-5.5 pointed straight at Claude’s coding lead, and a day earlier Google upgraded Gemini Enterprise. In this edition: what actually broke inside Claude Code (three stacked bugs, 3% eval drop), how the big three moved in 48 hours, and what it means for anyone picking an agent platform.

Topic of the Week

The April 23 Collision

Anthropic’s postmortem is worth reading in full, but here’s the gist: not an outage — a month-long quality regression in Claude Code, Agent SDK, and Cowork (API users were fine). Three independent bugs stacked. First, on March 4, they quietly lowered default reasoning effort from high to medium to reduce tail latency — Claude literally got less smart, reverted April 7. Second, a prompt-cache optimization meant to prune old reasoning once from idle sessions had a bug that applied it every turn — so Claude was executing tool calls without remembering why it was calling them. Fixed April 10. Third, a single system-prompt line telling Claude to keep responses under 100 words cost 3% on evals for both Opus 4.6 and 4.7. Reverted April 20. The wild part: the caching bug passed human review, automated review, unit tests, e2e tests, and dogfooding. Two unrelated experiments masked it. It took seven weeks to catch.

Same day, OpenAI shipped GPT-5.5 with very agentic framing: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 400K context in Codex, 1M via API. Pricing is $5/$30 per 1M tokens ($30/$180 for Pro), paid-only, no free tier. Brockman pitched it as a step toward a “super app” unifying ChatGPT, Codex, and an AI browser. The timing is not subtle — this is pointed directly at Claude’s coding-agent moat.

Practical takeaway: Anthropic is resetting usage limits for all subscribers as an apology, and they explicitly credited public /feedback reports for surfacing the bug — the feedback loop worked. But pair a seven-week invisible regression with GPT-5.5 landing on the same calendar day, and the message is clear: the coding-AI market just got noisier, the benchmarks got harder to trust, and “test on your actual workload” is the loudest advice we can give this week.

Full postmortem: https://www.anthropic.com/engineering/april-23-postmortem

Fresh Papers

Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents
This is the kind of infrastructure paper that stops being boring the moment you try to run more than a handful of agents in production. The premise is that the whole industry is mid-transition from stateless model inference (send prompt, get response, forget everything) to stateful agentic execution (persistent tools, memory, session state that has to live somewhere between calls). The runtime architectures we have weren’t designed for that — every new agent session rebuilds its context from scratch, which means cost and latency grow linearly with fleet size. Aethon proposes a reference-based replication primitive: instead of reconstructing tools, memory, and session state on every instantiation, you clone a reference state. Constant time, regardless of how heavy the agent’s context is. This is the same pattern showing up across the managed-agents trend — Cloudflare’s Agent Cloud launch, Anthropic’s managed agents engineering post, the general platform-maturity push. If you’re building anything that needs to scale past a demo, this is the primitive you’ll want your runtime to support.

Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review
A survey rather than a benchmark, so no headline numbers, but worth flagging because it maps where bias enters the SDLC (planning, coding, review, deployment) and shows how little of it is actually being measured. If your team uses Claude Code, Copilot, Codex, or Cursor daily, these agents are quietly making systematic choices about which patterns, languages, and frameworks to prefer — and nobody’s really auditing that. This pairs uncomfortably with this week’s Anthropic postmortem: if multi-stage review missed a straightforward eval regression, fairness drift across a coding fleet is almost certainly going unnoticed too.

New Models

GPT-5.5 & GPT-5.5 Pro
OpenAI dropped its next generalist on April 23, framing it as the agentic successor to GPT-5.4 and pushing it into ChatGPT and Codex the same day. Benchmarks go straight at Claude’s coding lead: 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro. Context is 400K tokens in Codex, 1M via API. Pricing is $5/$30 per 1M input/output tokens for GPT-5.5 and $30/$180 for Pro, with 50% off on Batch/Flex and 2.5x for Priority. Latency matches 5.4 per token; a new “Codex Fast” mode runs ~1.5x faster at 2.5x cost. Paid tiers only — no free access. Greg Brockman pitched it as a step toward a “super app” unifying ChatGPT, Codex, and an AI browser. Following last week’s Agent Cloud launch, the generalist-plus-platform combo is taking shape fast. Announcement · TechCrunch

Kimi K2.6 by Moonshot AI
Open-weight release live on Kimi Chat and the Moonshot API, with weights on Hugging Face and endpoints at platform.moonshot.ai. Strong on coding and agentic tasks in both chat and agent modes on kimi.com. Getting traction this week as the go-to alternative in the Reddit thread on Claude Code being pulled from Pro plans.

Qwen3.6-35B-A3B on consumer hardware
Someone got it running at 79 tokens/sec with 128K context on a gaming PC (RTX 5070 Ti + 9800X3D). The unlock is the --n-cpu-moe flag, which offloads MoE experts to CPU so the model fits in consumer VRAM. Concrete proof that serious open-weight coding models now run locally at usable speeds — fuel for the “just run it ourselves” migration story. Reddit

Claude Code & Coding AI

Claude Code unbundled from Pro plan. Anthropic removed the CLI from the $20 Claude Pro tier — Pro users now need Max to run claude against their subscription. The Reddit thread hit 1,388 upvotes with the top comment framing it as “better time than ever to switch to Local Models” — community sentiment tipped toward Kimi K2.6 and Qwen3.6. Anthropic’s head of growth responded on Reddit; community translation: “We gave you Cowork, the CLI is Max-only now.” Worth noting: API keys still work with Claude Code. It’s an unbundling from the subscription, not a product kill.

v2.1.117 — subagent and MCP upgrades. Two things matter here: – Forked subagents on external builds: enable via CLAUDE_CODE_FORK_SUBAGENT=1 env var — previously gated – Agent-level MCP servers: mcpServers in agent frontmatter now loads for main-thread sessions started with --agent, so per-agent tool scoping actually works – Improved /model selection — smoother picker

Release notes: https://github.com/anthropics/claude-code/releases/tag/v2.1.117

v2.1.113 — native binary + sandbox tightening. The CLI now spawns a native Claude Code binary via per-platform optional dependency instead of the bundled JavaScript bundle — faster startup, less Node overhead. Also added sandbox.network.deny for outbound network restrictions in sandbox mode. Release notes: https://github.com/anthropics/claude-code/releases/tag/v2.1.113

One practical note tying to the Topic of Week: the postmortem confirmed Claude Code, Agent SDK, and Cowork users were hit by the regression (pure API was fine). Anthropic is resetting usage limits for all affected subscribers as compensation — check your account.

Also This Week

Google Ships Gemini Enterprise

While Anthropic was writing its postmortem and OpenAI was staging GPT-5.5, Google quietly dropped a major Gemini Enterprise update on April 22 — long-running agents, agentic collaboration spaces, advanced governance, partner-built agents available in the catalog, and a deepened SAP partnership putting Gemini directly inside core business processes. Nothing flashy for consumers, but for anyone deploying AI across a big organization this is the clearest “Google wants the enterprise agent layer” signal yet. Worth keeping on your radar if you’re evaluating agent platforms for regulated or large-org workloads — the three-way race between Claude, OpenAI, and Gemini is real now, not hypothetical.

Tools of the Week

Claude Design by Anthropic Labs
A new tool built on Claude Opus 4.7’s vision capabilities that generates prototypes, pitch decks, and marketing materials while enforcing brand consistency automatically. Aimed at design and marketing workflows — basically Canva-meets-Claude.

AI at Tenvalleys

At one of our banking clients we’re running a development project that leans hard on AI and the BMAD framework (Breakthrough Method of Agile AI-Driven Development) — an open-source approach where AI agents take on the roles you’d find in a real dev team: analyst, product manager, architect, developer, QA. Each role hands work off to the next with a structured spec — the same way humans do — except the agents can run in parallel and never lose the handoff format.

We’re using it to test different approaches to automating code-base migration at production scale. The question we’re trying to answer is the unglamorous one: which method actually survives when you point it at a real legacy codebase, not a toy repo? Different teams in the project are testing different approaches against each other — stay tuned to hear which one wins.

Working on legacy codebase migration and want to compare notes on what’s holding up at scale? Reach out at contact@tenvalleys.com.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Opus 4.7 lands after a month of complaints

Opus 4.7 and the new Claude Code desktop both dropped this week — and the headlines only tell part of the story. In this edition: why the community was furious before the drop, what Anthropic isn’t saying about agent safety, and a paper that reads like a cheat sheet for document-heavy AI in regulated industries.

Topic of the Week

Community Complaints → Opus 4.7 Drops

Last week was rough for Claude power users. An AMD AI director ran 6,852 Claude Code sessions and published data showing thinking depth had dropped 67% — the post hit 1,804 upvotes on Reddit and kicked off a wave of “is it just me or did Claude get dumber?” threads. Another thread with 699 upvotes pointed out the problem wasn’t even Claude-specific: multiple major models seemed to be degrading simultaneously. Opus 4.6 users reported lazy outputs, refusals on previously-fine prompts, and generally bizarre behavior. The vibe was: we’re paying premium prices for models that are quietly getting worse, and nobody’s saying anything.

Then on April 16, Anthropic dropped Opus 4.7 — and the benchmarks read like a direct response to every complaint. 13% coding improvement over 4.6, 3x more production tasks resolved on Rakuten’s benchmark, 21% fewer document reasoning errors (Databricks), and a 98.5% visual acuity score from XBOW’s penetration testing suite. Vision got a massive upgrade too: images up to 2,576 pixels (3.75 megapixels), roughly 3x what prior models could handle. There’s a new xhigh effort level for problems where you want the model to think harder, plus task budgets in public beta for controlling autonomous agent spending. Pricing stays the same — $5/$25 per million tokens — and it’s live on the API, Bedrock, Vertex AI, and Microsoft Foundry. Full announcement here: https://www.anthropic.com/news/claude-opus-4-7

Fresh Papers

Anthropic’s 2026 Agentic Coding Trends Report (full report PDF)

Anthropic published an enterprise whitepaper with 8 trends across three categories (foundation, capability, impact) backed by real case studies. The headline stat that should make everyone pause: developers already use AI in ~60% of their work, but can “fully delegate” only 0–20% of tasks. The gap between “using AI” and “trusting AI to run autonomously” is still massive. The report frames the key shift as implementer → orchestrator — engineers stop writing code line-by-line and start coordinating agents that do. Case studies worth noting: Rakuten ran Claude Code across 12.5M lines of code for 7 hours autonomously with 99.9% numerical accuracy. CRED (15M+ users) doubled their development speed. Augment Code compressed a project estimated at 4–8 months into 2 weeks. Fountain achieved 50% faster screening and 2x candidate conversions with multi-agent orchestration. The report also predicts an “onboarding revolution” — traditional ramp-up from weeks to hours — and that multi-agent systems will replace single-agent workflows as the standard architecture.

The Blind Spot of Agent Safety (paper)

Remember the Princeton reliability paper from Edition #001 — the one showing that agent benchmark scores keep climbing while real-world reliability barely moves? This week’s paper from LIME Lab makes that gap feel even more uncomfortable. They built OS-BLIND, a benchmark with 300 tasks across 12 attack categories, and tested how computer-use agents handle seemingly innocent instructions that cause harm through side-effects — not through adversarial prompts, but through normal-looking tasks that go wrong in context. The results are bleak: average attack success rate above 90% across most agents, including safety-aligned ones. Claude 4.5 Sonnet on its own hits 73% ASR, but put it inside a multi-agent system and that jumps to 92.7%. The most interesting finding is why: safety alignment kicks in during the first few steps of execution, then basically falls asleep. The agent checks itself early, decides everything looks fine, and then sleepwalks through the dangerous parts. For anyone building or deploying computer-use agents, this is a concrete reminder that “safety-aligned” and “safe in production” are still very different things.

Adaptive Query Routing for Financial, Legal, and Medical Documents (paper)

This one is close to home. The paper compares different RAG approaches specifically on financial, legal, and medical documents — the kind of content our banking clients deal with every day. It benchmarks vector-based agentic RAG (the standard approach: embed everything, search by similarity) against hierarchical node-based reasoning (follow the document’s structure and logic instead of just matching text). The results show that the best approach depends on the type of question — some queries need semantic similarity, others need structural navigation, and a tier-based hybrid that routes queries to the right strategy outperforms either approach alone. For anyone building document Q&A systems for regulated industries, this is a concrete comparison of what actually works rather than what sounds good in a blog post.

Also worth a read: Anthropic’s engineering team published how they built their Managed Agents infrastructure — the key pattern is decoupling the “brain” (Claude), “hands” (sandboxes), and “session” (durable event log). Stateless harness, on-demand containers, credentials never in the sandbox. Their p50 time-to-first-token dropped 60%. If you’re building production agent systems, this is the reference architecture.

New Models

Cloudflare + OpenAI: Agent Cloud
Not a model release, a platform play. Purpose-built infrastructure for running AI agents at scale: millisecond cold starts (“100x speed, fraction of the cost of containers”), Git-compatible storage for agent repos, and full Linux sandboxes (now GA). Ships with GPT-5.4, Codex, and open-source models — switching providers is a one-line config change. This landed the same week AWS and Google Cloud made similar moves. The infrastructure layer for AI agents isn’t emerging anymore — it’s crystallizing. openai.com

GPT-5.4-Cyber
OpenAI released a cybersecurity-specialized model with lower refusal boundaries for legitimate security research, as part of their “Trusted Access for Cyber Defense” program. Available to a limited group for now. Meanwhile Anthropic’s Claude Mythos Preview was restricted due to extraordinary cybersecurity capabilities. The signal: specialized, domain-tuned models are becoming a thing — not just general-purpose anymore. Reddit

Claude Code & Coding AI

Claude Code Desktop — full redesign with multi-session support. The headlines: parallel agents (run multiple coding tasks at the same time), visual diffs, PR tracking, live server preview — all inside one desktop app. The standout feature is Coordinator Mode — you spin up parallel sub-agents that work on different parts of a codebase simultaneously while a coordinator keeps them aligned. Available on Pro, Max, Team, and Enterprise plans. Vercel’s teams reported 7.6x more frequent deployments after adopting it. This is Anthropic’s clearest move yet toward “AI as a dev team member” rather than “AI as autocomplete.”

Auto mode
a new permission mode that sits between “approve every action” and “skip all checks.” Claude decides for itself whether each action is safe, while a background classifier blocks risky operations (mass deletions, data exfiltration, destructive bash). If an action gets blocked 3 times in a row or 20 times total, the session falls back to manual. This is Anthropic’s answer to --dangerously-skip-permissions — you get the speed of unattended agent runs without completely removing guardrails. Requires v2.1.83+, available on Max, Team, Enterprise, and API plans (not Pro).

v2.1.101 — massive stability release. 40+ bug fixes, including some that matter a lot if you run long sessions:

– Security fix: command injection vulnerability in LSP binary detection — patch this one – Memory leak fixed: the virtual scroller was retaining dozens of historical copies during long sessions (explains why things got sluggish after a few hours) – 7 separate –resume bug fixes — session resumption should finally feel reliable – Configurable API_TIMEOUT_MS — replaces the hardcoded 5-minute timeout, useful if you’re on slower connections or running heavy prompts – OS CA certificate store trust by default — enterprise teams behind TLS proxies, this one’s for you – /ultrareview — dedicated slash command for thorough code review sessions

/team-onboarding — your habits become documentation. This one deserves its own callout. Run /team-onboarding and Claude scans your last 30 days of usage — which commands you run, what workflows you follow, what patterns you’ve established — and generates a structured ramp-up guide for new team members. Instead of “sit next to someone for a week and figure it out,” a new developer gets a guide based on how your team actually works. If you’re onboarding anyone soon, try it.

Full changelog

Advisor Tool — Opus as a behind-the-scenes strategist. New API tool where a cheaper model (Sonnet or Haiku) runs the entire task but can consult Opus when it gets stuck. Not routing — the executor stays in control, Opus just advises. Results are striking: Haiku + Opus advisor doubled BrowseComp scores (19.7% → 41.2%) while costing 85% less than Sonnet alone. On SWE-bench, Sonnet + Opus advisor scored +2.7pp over Sonnet solo. There’s a max_uses parameter for cost control so the cheap model doesn’t call the expensive one on every step. If you’re building anything with the API and managing costs, this pattern is worth studying. Blog post

AI at Tenvalleys

This week we’ve come up with an idea to organize our first internal 10vOS Skill Hackathon — small teams, four hours, one goal: each team picks a repetitive task they actually hit in their daily work and builds a custom Claude Code skill to automate it. The bet is simple: the most valuable AI tooling is the tooling your team actually uses every day, not the impressive demo nobody opens again.

Stay tuned — we’ll share what came out of it in a few weeks.

If you’ve run something similar inside your company and have lessons to trade, email us at contact@tenvalleys.com or reach out on LinkedIn. We’d love to compare notes.

Worth Reading

Stanford’s 2026 AI Index Report
the annual state-of-AI report just dropped. Key findings: Anthropic currently leads model rankings, US and China are almost neck-and-neck on performance, and AI is being adopted faster than the personal computer or the internet were. IEEE Spectrum summary

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

A model too powerful to sell

Topic of the Week

Claude Mythos, Project Glasswing, and Managed Agents

So what actually happened? Anthropic built their most powerful model ever — Claude Mythos — and then decided it was too dangerous to sell. During testing, Mythos found security holes that no one had caught for decades: a 27-year-old bug in OpenBSD, a 16-year-old flaw in FFmpeg that survived five million automated tests. It scores 93.9% on SWE-bench (Opus 4.6 gets 80.8%). Basically, it’s better at finding software vulnerabilities than almost any human.

Instead of putting it on the market, Anthropic created Project Glasswing — a cybersecurity defense program. They invited 12 big tech companies (AWS, Apple, Google, Microsoft, CrowdStrike, NVIDIA, and others) to use Mythos for finding and fixing security holes, backed by $100M in usage credits. The deal: you find vulnerabilities, you share them within 90 days so everyone can patch. Anthropic says they have no plans to make Mythos generally available — at least not until they figure out how to do it safely. As CrowdStrike’s CTO put it: “The window between discovery and exploitation has collapsed.”

The second big announcement: Claude Managed Agents went to public beta. The idea is simple — instead of building your own infrastructure to run AI agents, Anthropic hosts them for you. You define an agent, it runs in their cloud with all the tools it needs, and you pay $0.08 per hour plus normal token costs. Early adopters like Notion and Asana are already using it. The cool part: you can watch what your agent is doing in real time and interrupt or redirect it mid-task.

For Glasswing participants, Mythos is priced at $25/$125 per million input/output tokens. Access is limited to 12 launch partners plus about 40 additional organizations. Side note: a D.C. court this week also allowed the Pentagon to maintain a blacklist of Anthropic over disputes about using Claude in autonomous weapons — so the relationship between Anthropic and the government is… complicated.

Three launches in one week. Anthropic isn’t just building smarter models anymore — they’re building the whole platform around them.

Project Glasswing · Claude Managed Agents · Claude Mythos on Vertex AI

Fresh Papers

One agent is enough (if you give it enough time to think)

There’s a popular idea in AI right now: if one agent is good, five agents debating each other must be better. This paper tested that. They gave a single AI the same amount of “thinking time” as a team of five agents working together — and the single agent won. Every time. Across multiple models (Qwen3-30B, DeepSeek-R1-70B, Gemini-2.5) and multiple benchmarks. The multi-agent setups only helped when the input data was heavily corrupted — basically, when things are so messy that having multiple guesses is better than one.

The takeaway: the reported advantages of multi-agent systems mostly come from giving them more compute, not from the architecture itself. If someone pitches you on “just add more agents” — this paper is worth sending them. arXiv

10 minutes of AI help makes people perform worse

This one stings. Researcher Michiel Bakker ran a series of randomized experiments and found that after just 10 minutes of using AI assistance, people performed worse on tasks and gave up more often than people who never used AI at all. It went viral on Twitter this week. The implication is uncomfortable: AI help can create a kind of learned helplessness — you get used to the assist, lose confidence, and then struggle more when it’s gone. Worth keeping in mind, especially for teams rolling out AI tools to non-technical users. arXiv

New Models

Gemma 4 (Google)

Google’s new open model family, and the numbers are impressive. The most interesting variant is the 26B model that only uses 4 billion parameters at a time (the rest stay “asleep”) — and still scores almost as well as the full 31B version. That means you can run a very capable model on a laptop with 16GB of RAM. It handles text, images, video, and audio, has a 256K context window, and it’s fully open-source (Apache 2.0). 10 million downloads in the first week. It also supports function calling, extended thinking, and agentic tool use out of the box — and runs on basically everything (llama.cpp, MLX for Mac, even in the browser via transformers.js). In edition #003 we talked about small Qwen models beating big ones — Gemma 4 is the same trend, just from Google. HuggingFace · Reddit

Muse Spark (Meta / Scale AI)

Meta’s first model from their new AI lab (Meta Superintelligence Labs), led by Alexandr Wang — the founder of Scale AI. The whole thing is backed by Meta’s $14.3B acquisition of 49% of Scale AI. Two things stand out. First, they claim it matches Llama 4 Maverick while using 10x less computing power. Second, it’s closed-source — no public weights. That’s a big shift from Meta’s whole “open-source AI” identity. Whether this means Meta is moving away from open models or just experimenting with a parallel track is the interesting question to watch. Meta AI Blog

Claude Code & Coding AI

Four new releases this week (v2.1.91 → v2.1.97). The highlights:

– Better answers by default (v2.1.94) — effort level changed from “medium” to “high” for all paid users. You should notice better results without changing anything – Focus View / Ctrl+O (v2.1.97) — a clean view that only shows your prompt and the final result, hiding the noise in between – Bigger MCP results (v2.1.91) — MCP tools can now return up to 500K characters without getting cut off (useful for database schemas) – AWS Bedrock wizard (v2.1.92) — guided setup for teams running Claude Code through AWS – Memory leak fixed (v2.1.97) — long sessions with MCP servers were eating 50MB/hr. Fixed now

Also worth noting: OpenAI’s Codex hit 3 million weekly users, up from 2 million a month ago. GitHub

Tools of the Week

TurboQuant (Google)

Google released a compression technique called TurboQuant that makes AI models use 6x less memory — with zero loss in quality. No retraining needed, works with any model. The practical result: people are now running Qwen 3.5-9B on a regular MacBook Air M4 with 16GB of RAM. A Mac Mini M4 Pro can handle 100K+ token context. The community on r/LocalLLaMA (1,746 upvotes) also adapted it for model weight compression, getting 3.2x memory savings. If you liked the M5 Max local model benchmarks from edition #003, TurboQuant is the software version of the same story: run bigger models on smaller hardware. Google Research · Reddit

AI at Tenvalleys

This week marked our first internal 10vOS workshop. 10vOS is the AI operating system we’ve been building at Tenvalleys — a stack of specialized agents, skills, and automations that powers how we run delivery, sales, content, and internal operations. It’s also what’s behind this newsletter.

So far 10vOS had been mostly a project a small group of us were driving. The workshop was the first time we walked the whole team through it — what it already does, how to set it up locally, and how to plug into it day-to-day. The point wasn’t a demo: it was onboarding. We want every person at Tenvalleys to use 10vOS in their own work and contribute new skills back into it, so the platform keeps growing in the directions the team actually needs.

The bet is simple: the best AI tooling is the tooling your team uses and shapes — not the impressive demo nobody opens again.

If you’re thinking about rolling out something similar inside your own organisation, or you’d like to see what 10vOS does in practice — reach out at contact@tenvalleys.com.

In the Background

OpenAI published a policy proposal this week that’s worth a read. They’re calling for a Public Wealth Fund — the idea is that every American would automatically get a stake in AI companies, funded by higher capital gains taxes on AI-driven returns. On top of that, they propose a government-subsidized four-day work week to help people transition as AI takes over more tasks. So basically, an AI company is saying: tax us more, let people work less, and share the profits. Whether you see this as genuine corporate responsibility or a PR move to get ahead of regulation — it’s the first time a major AI lab has put something this concrete on paper about how to redistribute AI wealth. TechCrunch

For Dessert

Demis Hassabis — Nobel Prize winner, CEO of Google DeepMind, the guy whose AlphaFold cracked the 50-year protein folding problem — gave an interview this week where he said something you don’t usually hear from someone running one of the three biggest AI labs on Earth: “If I’d had my way, I would have left AI in the lab for longer. Done more things like AlphaFold. Maybe cured cancer or something like that.”

Let that sink in. The man in charge of Google’s AI is publicly saying the commercial AI race was a mistake. That ChatGPT forced everyone into a sprint toward chatbots and products, when the technology could have been solving cancer, energy, and materials science — slowly, carefully, like CERN.

He also laid out what worries him most: not bad actors using AI, but AI itself going rogue in the next 2-4 years as we enter “the agentic era.” His words: “How do we make sure the guardrails are put in place so they do exactly what they’ve been told to do? That’s going to be an incredibly hard technical challenge.”

A Nobel Prize winner saying the alignment window is 2-4 years. Worth thinking about over the weekend. X

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

44 hidden flags inside Claude Code

3 April 2026

Today we’re starting a bit differently. Some food for thought — the kind that’s sometimes needed in the AI arms race we’re all living through right now.

Earlier this week at the NextGen AI Conference there was one talk that’s hard to shake. Most of the program was what you’d expect — new models, new tools, people demoing things that’ll be outdated in three months. Good stuff, genuinely exciting. But this one was different.

It was about the people behind the data labeling. Kenyan workers hired to train ChatGPT’s content filters. They earned about $2 an hour, pulled 20-hour shifts, and spent that time labeling the content that AI models need to learn to filter out — graphic images of violence, murders, child abuse, pornography, the worst things humans do to each other. The psychological damage is real — workers have reported PTSD, nightmares, lasting trauma.

It’s the kind of story you’d expect to land if you follow AI ethics — but most readers won’t have heard of it before. And once you start digging, you learn how the system is designed. The big AI companies don’t employ these workers directly. They outsource through chains of smaller subcontractors — layer after layer — which conveniently shields them from any responsibility for what happens to the people at the bottom. It’s structured so that no one is accountable. And that’s exactly why these stories don’t reach us.

Watch the 60 Minutes investigation — it’s about 15 minutes and worth every one of them.

There aren’t easy answers here. Yes, there’s irony in writing this in an AI newsletter. That’s part of the point.

Anyway. Here’s what happened in AI this week.

Topic of the Week

The Claude Code Source Map Leak

You’ve probably seen the headlines all week — and read at least three different breakdowns. So rather than rehash what you already know, here’s a clean summary plus the details most coverage buried or got wrong.

On March 31, someone at Anthropic shipped npm package v2.1.88 without adding *.map to .npmignore. The result: a 59.8 MB JavaScript source map file went out to the public registry, exposing roughly 512,000 lines of TypeScript source code. Within hours, mirrors were up across GitHub. The initial reports flagged 35 hidden feature flags; the actual count turned out to be 44.

The discoveries inside are more interesting than the leak itself:

– KAIROS — an unreleased background agent that stays alive between sessions and can act on its own (monitor GitHub, send notifications). Named after the Greek concept of “the right moment.” Anthropic is clearly thinking about AI that doesn’t wait for you to ask. – Undercover Mode — when Anthropic employees use Claude Code on public repos, this hides all traces: no “Co-Authored-By” tags, no mentions of internal tools or unreleased models. Stealth mode for dogfooding in the wild. – Buddy — yes, someone built a Tamagotchi pet system inside Claude Code. Collectible creatures with rarity tiers and shiny variants. Not shipped, but fully built. Someone at Anthropic had fun with that one. – WTF Telemetry — a file called userPromptKeywords.ts watches for frustration words like “wtf,” “omfg,” “dammit” and logs them. No way to opt out of just this — it’s all telemetry or nothing. The most debated find by far.

Anthropic’s official response was brief: “No sensitive customer data or credentials were involved. Release packaging issue caused by human error, not a security breach.” Technically accurate — this wasn’t a hack, it was a missing line in a config file. But the real takeaway isn’t about security. It’s about what the hidden feature flags reveal: Anthropic is building toward persistent, autonomous agents that run in the background, and they’re already instrumenting frustration signals to improve the experience. The leak is embarrassing; what it shows about the roadmap is genuinely fascinating.

Reddit | VentureBeat | The Register

What the community did with it

The internet didn’t just read the code — it got to work. Someone extracted the full multi-agent orchestration system (coordinator mode, tool routing, team management) and packaged it as an open-source framework that works with any LLM — 742 upvotes on launch. Now that both Claude Code and Codex are visible, people did a proper architectural comparison: Claude Code is an interactive copilot (plans, asks for confirmation, executes step-by-step, 17 programmable hooks for governance), while Codex is an autonomous executor (delegate and it runs end-to-end). Safety works differently too — Codex locks things down at the OS kernel level, Claude Code does it at the application layer. The biggest differentiator? Claude Code’s Agent Teams — sub-agents that each get their own context window and git worktree, and can message each other mid-task.

Separately, someone reverse-engineered the binary and found two cache bugs silently 10-20x-ing API costs. Bug one: a string replacement for billing tracking can accidentally break the cache prefix. Bug two: --resume misses the cache entirely. Max 5x users went from 8 hours of work to 1 hour; Max 20x users saw usage jump from 21% to 100% in a single prompt. Workarounds: use npx instead of global install, avoid --resume, some report downgrading to v2.1.34 helps. Anthropic’s Lydia Hallie confirmed they’re actively investigating.

Multi-agent extraction | CC vs Codex | Cache bugs

Fresh Papers

“Terminal Agents Suffice for Enterprise Automation”
ServiceNow Research Read the paper

Everyone’s building MCP tool stacks right now. Custom tools for every API, elaborate integrations, carefully orchestrated pipelines. ServiceNow’s research team just tested that approach against something much simpler: an agent that writes and runs code in a terminal. Across 729 tasks on real enterprise platforms (ServiceNow, GitLab, ERPNext), the terminal agent matched web-browsing agents at 5-9x lower cost — and blew past the MCP approach entirely. ServiceNow’s own platform exposed 93 MCP tools, and agents using them still couldn’t complete basic tasks like ordering from the service catalog. The best MCP setup topped out at 55% success. Meanwhile, Claude Sonnet running terminal commands hit 72.7% at $0.56 per task, compared to $3.29 for the web agent doing the same work.

Two findings stand out beyond the headline. First, throwing documentation at agents doesn’t automatically help — reference-style API docs actually misled them. Only task-oriented guides (step-by-step “here’s how to do X”) improved performance. Second, when terminal agents saved successful task solutions as reusable “recipes” for later, accuracy went up 3.6-5.8 percentage points and costs dropped 17-44%. Skills that compound over time beat tools that don’t learn.

The practical takeaway: before building another custom integration layer, consider whether a capable coding agent with API access already solves your problem. The paper suggests that for a surprising range of enterprise tasks, it does — faster, cheaper, and with less maintenance overhead. That’s worth keeping in mind next time someone pitches a 50-tool MCP server as the answer.

New Models

TurboQuant (Google Research)
Training-free compression that squeezes KV cache down to 3 bits with negligible quality loss. The community then adapted it for model weights too. Bottom line: Qwen3.5-27B now fits on a $400 RTX 5060 Ti with 16GB VRAM — and people are running it on a MacBook Air. The blog post got 12M views; the arXiv paper has 2 citations. Says a lot. Google Research | Reddit

Gemma 4 (Google DeepMind)
Four new open-weight models, now under Apache 2.0. The 31B dense model hits 89.2% on AIME 2026 and 2,150 Codeforces ELO. The sleeper hit: E4B runs on a T4 GPU and still pulls 42.5% on AIME. Multimodal, native reasoning, runs on a Raspberry Pi. Google DeepMind | Reddit

Claude Mythos (Anthropic) — teaser
A leaked model tier called “Capybara,” sitting above Opus. “Dramatically higher” scores on coding, reasoning, and cybersecurity. Plans and executes autonomously across systems. No pricing, no release date, “very expensive to serve.” We’ll cover it when it ships. Fortune

Tools of the Week

Cline Kanban is a standalone app for CLI-agnostic multi-agent orchestration. It gives you a Kanban board where every card is a live agent task. Set up dependency linking so when a parent task completes, dependent tasks auto-kick-off. Each task gets its own terminal and git worktree. Works with Claude Code, Codex, Cline, and others. Install with npm i -g cline — local-first, no cloud needed. Cline Kanban

AI at Tenvalleys

This week’s internal pick: /ideate — a skill that turns five ML architectures into structured creative thinking modes. Breed and cross-pollinate ideas (Evolutionary), refine rough drafts from noise to clarity (Diffusion), stress-test proposals through adversarial attack loops (GAN), sharpen positioning by defining what something is NOT (Contrastive), or compress a complex argument down to one sentence (Distillation).

It’s been useful for the kind of writing that needs sharpening rather than starting from scratch — hero-section copy, positioning statements, the line that has to do a lot of work in a small space. Fifteen minutes inside the skill produces a tighter result than fifteen minutes of staring at a blank page.

If you’re building a library of in-house AI skills your team will actually use, or you’d like to see how /ideate works in practice — reach out at contact@tenvalleys.com.

Must-See

“The AI Doc: Or How I Became an Apocaloptimist” — a new full-length documentary (1h 43min) directed by Daniel Roher, who won the Oscar for “Navalny,” and produced by the team behind “Everything Everywhere All At Once.” It features Sam Altman, Dario and Daniela Amodei, and tackles the big question head-on: is AI the collapse of humanity, or our ticket to the cosmos? Sitting at 8.2 on IMDb and 87% on Rotten Tomatoes, it’s currently in US theaters only (Focus Features, since March 27) but coming to Apple TV later this year. Worth putting on your watchlist for when it hits streaming.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Claude can now use your computer

This was Anthropic’s week. Claude learned to use your computer, got a new auto mode, started dreaming (yes, really), and showed up on Discord. 74 million people watched that happen on Twitter alone. But it wasn’t just Anthropic — Xiaomi proved you don’t need billions to build a top coding AI, Google launched privacy-preserving models for banks, and OpenAI quietly killed Sora. Let’s get into it.

Topic of the Week

Claude Can Now Use Your Computer

Anthropic shipped Computer Use on March 23 — and the internet lost it. 74 million views, 139K likes, 25K reposts. Those aren’t normal numbers for an AI feature launch.

Here’s what it does: Claude can now open apps on your Mac, navigate your browser, fill in spreadsheets — anything you’d do sitting at your desk. You can send it a task from your phone, go do something else, and come back to finished work on your computer.

Now, technically, “computer use” existed before — Anthropic launched a developer API version back in October 2024. But that was raw infrastructure. You needed Docker containers, VNC servers, and coding skills to make it work. What shipped this week is completely different: no setup required, just enable it and Claude sees your screen. Think of it like self-driving cars — the old version gave engineers access to raw sensor data. This one lets a normal person press “drive me to work.”

The smart part: when Claude has a proper integration (like Google Calendar or Slack), it uses that. But when there’s no connector — say, your company’s internal HR tool or that legacy system nobody built an API for — it falls back to clicking through the app like a human would.

Available now as a research preview for Pro and Max subscribers, macOS only. Anthropic recommends not using it with sensitive data yet.

CNBC | Engadget

Claude Code & Coding AI

But Anthropic didn’t stop there. It almost feels like they don’t stop at all — Twitter and Reddit are going crazy. Here’s everything else they shipped this week:

Auto Mode (Mar 24) — The middle ground between “approve every single action” and “let Claude do whatever it wants.” A classifier checks each action before it runs — safe ones proceed automatically, risky ones get blocked and Claude finds another way. Enable with claude --enable-auto-mode, cycle to it with Shift+Tab. Available on Team plan now, Enterprise rolling out.

TechCrunch

Auto Dream
This one is wild. Claude Code now has a “REM sleep” cycle for its memory. Every 24 hours (after at least 5 sessions), a background agent wakes up and cleans house: converts relative dates like “yesterday” to actual dates, removes contradicted facts, merges duplicate entries, and prunes the memory index to stay under 200 lines. If Auto Memory is the note-taking, Auto Dream is the filing system that keeps those notes useful over time.

claudefa.st

Claude Code Channels
Claude Code is now on Discord and Telegram. Message it a task from your phone, it executes on your machine. Anthropic is clearly building toward a world where your AI assistant is always reachable, not just when you’re at your terminal.

MCP Tools on Mobile (Mar 26) — Figma, Canva, Amplitude, Slack — all the integrations that launched on desktop in January now work on the Claude mobile app. Explore designs, create slides, check dashboards, all from your phone.

Projects in Cowork (Mar 20) — Keep your tasks and context in one place, focused on one area. Files and instructions stay local on your computer.

Version releases (v2.1.81 → v2.1.84)
Three releases this week. Highlights: --bare flag for scripted calls (v2.1.81), managed-settings.d/ for team policy fragments (v2.1.83), and PowerShell tool for Windows (v2.1.84).

Fresh Papers

Governed Memory: A Production Architecture for Multi-Agent Workflows

Your AI agents are all working on the same customer, but none of them remember what the others learned. This paper finally fixes that.

The problem: enterprise AI deploys dozens of agents across workflows — sales, support, ops — each acting on the same entities with no shared memory and no governance. RAG solves retrieval but not governance: who stores what, which policies apply, and whether quality is silently degrading.

The solution is a four-layer architecture: ingestion, governance routing, retrieval, and schema lifecycle. Results: 99.6% fact recall, 92% governance routing precision, 50% token reduction through progressive delivery, and zero cross-entity leakage across 500 adversarial queries. Already in production at Personize.ai.

Multi-agent systems with governance are emerging as a clear pattern for enterprise AI delivery — banking, insurance, regulated workflows in particular.

arXiv

VaultGemma: The World’s Most Capable Differentially Private LLM

Google trained a 1B-parameter model that mathematically cannot leak your data. VaultGemma uses differential privacy (adding calibrated noise during training) so that no individual data point can be extracted — ever. The privacy guarantee: epsilon <= 2.0, delta <= 1.1e-10. In plain language: zero detectable memorization of training data.

The catch: it’s not as smart as today’s best models. Google is honest about it — current DP-trained models perform roughly like non-private models from 5 years ago. But the gap is closing. And for regulated industries like banking and healthcare, where “good enough + guaranteed private” beats “amazing but might leak” — this is a big deal.

Google Research

OpenAI Model Spec: How Should AI Behave?

OpenAI published a 100-page framework defining exactly how their models should behave — who they listen to, what they refuse, and how they resolve conflicting instructions. It’s built around a chain of command: safety first, then OpenAI’s policies, then developer rules, then user preferences. The whole thing is public so researchers and policymakers can “read, inspect, and debate” it.

Interesting data point: current compliance rates range from 72% (GPT-4o) to 89% (GPT-5 Thinking). So even with a spec, models don’t follow it perfectly. The gap between “intended behavior” and “actual behavior” is itself a research problem.

Time | OpenAI

New Models

Xiaomi MiMo-V2-Flash
A phone company just built the #1 open-source coding model. MiMo-V2-Flash scores 73.4% on SWE-Bench Verified, beating every other open model. It’s a 309B MoE model (15B active parameters) with a 256K context window. Price: $0.10 per million input tokens — that’s 3.5% of what Claude Sonnet costs for comparable coding performance. Open-source under MIT license.

In edition #002 we covered Qwen beating big models on narrow tasks. In #003, Qwen nearly matched Claude Opus on SWE-bench. Now Xiaomi enters the ring. The Chinese open-source wave is widening from Alibaba alone to multiple hardware companies.

Reddit | GitHub

Google Gemini 3.1 Flash Lite
Google’s answer to the pricing war. 2.5x faster than Gemini 2.5 Flash, $0.25/M input tokens, 381 tokens/sec, 1M context window. Beats GPT-5 mini on 6 out of 11 benchmarks. Google’s clearly going after high-volume enterprise workloads where speed and cost matter more than peak intelligence.

OpenAI GPT-5.4 mini + nano
OpenAI going smaller too. GPT-5.4 mini is 2x faster than GPT-5 mini, optimized for coding, computer use, and subagents. Nano goes even smaller at $0.05/M input tokens. Everyone’s racing to the bottom on cost.

The trend is clear: the pricing war is collapsing the cost curve for capable AI. What cost $3/M tokens last year costs $0.10 now.

Fun Break

AI Makes Music Now (For Real This Time)

Remember Lyria 3 from our first edition? Back then, Google’s music AI could generate 30-second clips — fun to play with, but not exactly a song.

One month later: Lyria 3 Pro generates full 3-minute tracks with intros, verses, choruses, and bridges. It actually understands song structure now — you can prompt for specific musical elements and get something that feels composed, not just generated.

Available in the Gemini app for paid subscribers, and on Vertex AI for businesses who need audio at scale (think game soundtracks, video platforms).

From 30-second jingles to full songs in a month. That’s the pace of AI right now.

Google Blog

In the Background

Sora is dead. OpenAI is shutting down its video generation app — just six months after launch. Sora shot to #1 on the App Store in September, but by January downloads had dropped 45%. Disney was supposed to invest $1 billion and license Mickey Mouse for Sora content — the deal never closed. OpenAI says the research team will refocus on “world simulation for robotics.” Translation: the compute was too expensive for a product that wasn’t sticking.

AI at Tenvalleys

Introducing 10vOS

At Tenvalleys we’ve been building our own AI operating system — 10vOS. It’s a system of specialized agents, skills, and automations that powers how we build client presentations, proposals, interactive dashboards, landing pages, branded documents — and yes, this very newsletter. The data collection, article deep-reading, and trend analysis behind every edition is orchestrated by 10vOS; a human editor does the final pass.

The bet is simple: AI is most useful where it automates the parts of work that drain time without adding judgment, leaving humans more space for the judgment calls. We’re using 10vOS internally first because we’d rather find the bruises ourselves than on a client project.

If you’re thinking about building something similar inside your own organisation — or you’d like to see what 10vOS already does in practice — reach out at contact@tenvalleys.com.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.