Nikola Powalka, Author at Tenvalleys

Locked out of the best model

Posted on June 19, 2026June 19, 2026 by Nikola Powalka

Last Friday we handed you Claude Fable 5. Three days later the US government took it back — an export-control directive that suspends Fable 5 and Mythos 5 for every foreign national, which, as a Polish company, means us. The rest of the week reads like a reply.

Topic of the Week

The US locks foreign users out of Fable 5

What happened. Days after Fable 5 went live for everyone, the US government issued an export-control directive, citing national-security authorities, ordering Anthropic to suspend all access to Fable 5 and Mythos 5 for every foreign national — inside or outside the US, including Anthropic’s own non-US employees. Anthropic complied and pulled both models for all customers globally. Every other Claude model is unaffected: new sessions just fall back to your default model or Opus 4.8, and any in-flight Fable 5 session ends with an error. So the model didn’t get worse — it got geographically unavailable, overnight, by someone else’s government.

The two stories. This is where it matters to keep both versions straight, because the sources don’t agree. Anthropic’s framing: the government believes it found a way to “jailbreak” Fable 5, Anthropic reviewed the demo and says it only surfaced a few minor, already-known vulnerabilities that other public models can find too — and that recalling a model used by hundreds of millions over a “narrow potential jailbreak” is an overreaction. The administration’s framing (via David Sacks): Anthropic was warned the model could be jailbroken and didn’t fix it. Both are on the record; we’re not picking which is true. What’s not really in dispute is the bind for getting it back — WIRED reports Anthropic would have to guarantee the guardrails can’t be circumvented, and security researchers are blunt that this isn’t a thing anyone can promise.

Why it matters. Remember the safety valve we flagged last week — Fable quietly sending its own most dangerous questions to a weaker model? Turns out that wasn’t enough for Washington. The lesson is simple: if your whole setup leans on one provider’s model, someone else’s government can switch it off for you, with no warning. The model didn’t break — it just disappeared. So having a backup model you can fall back to isn’t a nice-to-have anymore. And, conveniently, this same week showed us exactly what that backup could look like. (Anthropic’s statement.)

The other half

open weights stopped being the cheap option and became the safe one

GLM-5.2 — the frontier, downloadable. The same week the proprietary #1 got pulled, Z.ai shipped GLM-5.2 under an MIT license — genuinely open weights you can download and run on your own hardware. And it’s not a budget compromise: on coding it edges out GPT-5.5 and lands just behind Opus 4.8, making it the strongest open model out there for agentic work. One analyst put it bluntly: with Fable gone, GLM-5.2’s top tier is arguably the best coding model most of the world can actually use right now.

We’ve been watching this rope get thicker for weeks — Gemma on a laptop in #014, Mistral Vibe’s self-hostable coding agent in #016. This week it reached the top. The uncomfortable symmetry: the model you can’t have anymore is closed, and the one that just caught up is something you can run on your own hardware. Open weights stopped being the budget play and became the continuity play — “can’t be revoked by a government you don’t vote for.” The honest catch: running it yourself needs a lot of expensive hardware, and if you just use Z.ai’s online version instead, your data goes to China. So “open” only protects you if you actually host it yourself.

Fable 5 took the crown — for about three days. Right before it got pulled, Fable 5 edged out GPT-5.5 Pro on Epoch’s overall capability ranking — Anthropic’s first time at #1 there in over a year. It was the narrowest of leads, basically a tie, so the headline isn’t “Anthropic dominates.” It’s that the best model in the world and the one Europe can’t open a session on are, this week, the same model.

Fresh Papers

the rulebook for agents is being written

A genuinely dense week for governance research — several independent arXiv threads all circling the same question. Two stood out. And it wasn’t only arXiv: the big labs published real science too.

Trust Between AI Agents — paranoia kills faster than naivety. When several agents work as a team, each one has to decide how much to trust the others. The researchers turned that into a simple game: you can double-check a teammate’s work, but every check costs you. So how often you check shows how much you trust. The smart models learned to relax once a teammate proved reliable; the weaker ones kept checking everything, forever. And here’s the line worth remembering — the agent that trusted nobody lost almost every time, not because it got betrayed, but because it was so busy checking that it never got around to deciding. The takeaway for anyone building with multiple agents: forcing the system to verify every single step doesn’t make it safer, it makes it freeze up.

SkillVetBench — be careful which skills you install. More and more, we extend our agents by installing ready-made “skills” that other people share — the same way Claude Code does. The catch this paper points out: a malicious skill usually hides its bad behaviour not in the code, but in the plain-English instructions telling the agent what to do. And that’s exactly the part normal security scanners don’t read — in the tests, the usual tools missed almost all of these instruction-based attacks. The practical lesson is simple: treat a downloaded skill like any other untrusted software. Skim what it actually tells the agent to do before you hand it the keys, especially if it can touch your files, your data, or run commands. It’s not specifically about Claude Code — it’s the risk in any place where you grab community skills.

And it wasn’t all arXiv — the big labs went to the lab. Beyond the usual product launches, there was a quiet wave of actual science this week. OpenAI showed a near-autonomous “AI chemist” improving a real reaction in medicinal chemistry, Google DeepMind reported in Nature that its medical AI can match primary-care doctors at managing ongoing health conditions, and Anthropic showed a plain, un-fine-tuned Claude reading a molecule’s structure from its NMR spectra about as well as the specialized software chemists pay for. Three labs, same week, same direction: general models pointed at hard scientific problems, not just code and chat.

Claude Code & Coding AI

v2.1.183 — Auto Mode learns restraint. Two weeks after Claude Code learned to spawn subagents five levels deep, this week’s release teaches it to keep its hands off the panic buttons. In Auto Mode it now blocks destructive git commands (git reset --hard, git checkout -- ., git clean -fd, git stash drop) when you didn’t ask to discard work, blocks git commit --amend on a commit the agent didn’t make this session, and blocks terraform/pulumi/cdk destroy unless you named the specific stack. The clever bit is that it’s intent-scoped — not a blanket ban, just “no, unless you actually asked for that.” The guardrails catching up to the autonomy.

claude-code-setup — an official “set it up for me” plugin. A real, Anthropic-published plugin (not the “feels completely different” hype the tweets gave it). It scans your repo and recommends the top one or two automations across hooks, subagents, MCP servers, skills, and slash commands — read-only, so it advises and you decide. Genuinely useful if you’ve been meaning to configure Claude Code properly and never got around to it.

Codex inside Claude Code — the two-model loop. OpenAI shipped an official plugin (21k+ stars) that lets you summon Codex from inside a Claude Code session — one model implements, the other reviews, without leaving your terminal. The viral “burn 50% less Claude limit” tip going around is a community trick, not a promise: you’re just moving the cost to OpenAI’s meter. Useful as an adversarial review loop, not as free compute.

Tools of the Week

Memanto — open-source memory for your agents. An MIT-licensed memory layer that plugs into Claude Code, Codex, Cursor and a dozen others over MCP: remember, recall, answer across sessions. The self-hostable version really is free, though it’s early (~1k stars) and the “no vector database” line is a bit of marketing spin — it’s built on the team’s own vector engine, with a paid cloud tier waiting. Worth a look if you’ve wanted persistent memory without paying for a hosted SaaS to try it.

Mistral Connectors API — register once, reuse everywhere. Now in public preview: register your MCP connectors once and use them across Le Chat, AI Studio and the rest of Mistral’s surface, instead of re-wiring them per product. Small, but it’s the kind of plumbing that decides how painful integration actually is.

In the Background

The commercial undercurrent to all of this: per-token pricing for heavy agentic use is suddenly under question. Anthropic paused a planned token-based billing change for the Claude Agent SDK days before it took effect, and reporting says Microsoft is testing DeepSeek alongside OpenAI and Anthropic for Copilot as usage-based costs climb. Same theme as the open-weights story from a different angle — cost and control are both pushing buyers to keep their options open.

AI at Tenvalleys

A milestone on our side: three of our engineers are now officially Claude Certified Architects — the first of the team to earn it, with more on the way. It’s part of going deeper on building well with Claude, not just using it. If your team works with Claude and wants to do the same, the certification is open to anyone: clau.de/CCAF.

AI Pulse — every Friday. Feedback? Drop us a message.

Claude Fable 5: Anthropic’s most powerful model goes public

Posted on June 12, 2026June 12, 2026 by Nikola Powalka

This week Anthropic released Claude Fable 5 — the most powerful model it has ever made public, now on everyone’s plan. The real story isn’t the benchmark sweep, it’s the safety valve wired inside it, quietly routing its own most dangerous questions down to a weaker model. Around it: open models small enough to live on your laptop, a fresh take on Claude Code out of Europe, and one almost-anonymous French engineer reminding us that a single person with a compiler can still leave a dent the size of the whole internet.

Topic of the Week

Claude Fable 5 goes public

What happened. On June 9 Anthropic put Claude Fable 5 in everyone’s hands — the most powerful model it has ever released publicly, and a “safe-for-general-use” version of its locked-down Mythos model. You feel the jump most on the long, messy, multi-step work that used to grind teams down. The story that made the rounds: Stripe handed it a code migration that normally eats a couple of months of engineering, and it was done in a day. It’s free to try on Pro, Max and Team plans until June 22, then it moves to usage credits.

What people are building with it. Within a day, timelines filled with one-prompt demos that are genuinely hard to believe: a playable Minecraft clone with biomes, ores and a day-night cycle in ~20 minutes, a working Swiss watch escapement in Three.js (real gear ratios, a breathing hairspring, hands showing the actual time), a cloned Windows desktop down to Solitaire and Edge, and a humanoid-robot design draft that ate ~1.4 million tokens in two hours. One thing to keep in mind: not all of it is real. A few of the most-shared clips turned out to be fakes — one person passed off old GTA-6 footage as Fable’s work — and The Register spotted the opposite problem too: Fable 5 sometimes refuses completely harmless prompts. Fun to scroll through, just don’t believe everything.

The twist worth noticing. Fable 5 ships with safeguards built into the model, not bolted on around it. On sensitive cybersecurity, biology and chemistry requests it doesn’t refuse — it silently falls back to Opus 4.8 to answer, and Anthropic says that fallback triggers in under 5% of sessions. External red-teamers spent 1,000+ hours hunting for a universal jailbreak and found none (the UK’s AI Safety Institute made partial progress). The unsafeguarded version — Claude Mythos 5, same underlying model with the guards lifted — is not on general release: it goes only to vetted cyberdefenders and a few biology researchers through Project Glasswing, a program run with the US government.

Why it matters. This is the cleanest example yet of a lab trying to ship frontier capability and frontier caution in the same product. A model that routes its own dangerous queries to a less capable sibling is a genuinely new design pattern — and it lands days after Anthropic itself warned (the Favaro/Clark post we closed on last week) that AI building AI could make humans lose control. The practical read for anyone evaluating models: Fable 5 is the new ceiling for hard, long-horizon work, but the safeguards mean its behaviour on edge-case prompts won’t always be the “real” Fable 5 answering. Worth knowing before you wire it into anything.

Fresh Papers

This week’s two papers rhyme: the hard part of building a good agent isn’t the model — it’s the environment you train it in.

DeNovoSWE — small models that build whole repos. Nobody has much training data for “here’s a spec, now write the entire codebase,” so this team built a pipeline to manufacture thousands of examples — each a real, working repository. Train a mid-sized open model on that and it goes from barely functional to near-frontier at building projects from scratch. The point worth keeping: a self-hostable model can get surprisingly close to the big names on greenfield work — no frontier API bills, no shipping your code to anyone.

Agentic Environment Engineering — a survey that names the discipline. It treats the sandbox an agent works in as a real engineering problem in its own right, and floats “Environment-as-a-Service” as the next step. Read next to last week’s harness-tree paper, the drumbeat is clear: the leverage is shifting from clever prompts to the scaffolding around the model.

New Models

Gemma 4 12B — the laptop-sized sequel. Remember the Gemma 4 31B deep-dive in #014, where the verdict on consumer hardware was “unusable” on a 16GB card? The 12B is Google’s fix for exactly that — an Apache-2.0 multimodal model (text, images, audio, video; 256K context) built to actually run on a 16GB laptop, reportedly near the bigger 26B at half the memory. If it holds, the “Claude plans, Gemma builds” loop we sketched in #014 could finally run on your own machine, not an H100.

Mistral Vibe — Europe’s answer to Claude Code. Mistral turned Le Chat into a coding agent: a plan-and-execute “Work mode,” sandboxed agents that open PRs, a CLI and a VS Code extension. Roughly Sonnet-level on coding benchmarks at about half the cost — and open enough to self-host, the data-sovereignty angle the US labs don’t offer.

Claude Code & Coding AI

The releases (v2.1.163 → v2.1.174). Fable 5 landed directly in Claude Code (v2.1.170). Two changes stand out for heavier users: nested subagents (v2.1.172) — subagents can now spawn their own subagents up to 5 levels deep — and a fallbackModel setting (v2.1.166) that lets you list up to three backup models Claude tries in order when the primary is overloaded, with an automatic one-shot retry. There’s also a new --safe-mode flag to launch with all customizations disabled, and /cd to move a session’s working directory without nuking the prompt cache. (Changelog.)

Worth a read. Anthropic published the explainer for dynamic workflows (the feature we flagged a couple weeks back). It walks through six reusable patterns — fan-out-and-synthesize, adversarial verification, tournament, loop-until-done and more — for when you want Claude to write its own orchestration harness instead of leaning on the default one.

Tools of the Week

Gemini 3.5 Flash Live Translate — speech-to-speech that doesn’t wait its turn. Google’s new real-time translation model covers 70+ languages and, instead of waiting for you to finish a sentence, generates translated speech continuously — staying just a few seconds behind and keeping your own intonation and pacing. It’s in public preview via the Gemini Live API and rolling into the Translate app; a Google Meet private preview can handle 2,000+ language combinations in a single meeting. The use case Google points to: Grab’s 10M+ monthly driver-rider voice calls.

A tidier Bedrock console. Amazon Bedrock shipped a new console optimized for Anthropic- and OpenAI-compatible APIs, making model selection and deployment less fiddly — small, but it’s the plumbing that decides how easily a regulated client can actually adopt these models.

AI at Tenvalleys

We say it internally often enough that it’s worth putting in writing: we think the real backbone of AI transformation isn’t the enterprise rollout — it’s education. The people who’ll build with these tools for the next thirty years should grow up AI-fluent, not get retrofitted at 30. That belief is why our fully pro-bono engagement goes to Wiśniowa Technikum in Warsaw, where we’re helping the faculty rebuild the “Technik Programista” track into an AI-native “Programista AI” curriculum, launching September 2026.

This Monday the school hosted a conference, and our CEO Daniel Bachan represented us on stage with a talk on how a programmer’s job has changed in the age of AI. The through-line: the programmer stopped being a machinist typing code line by line and became an architect, conductor and navigator — what stopped mattering is the typing; what became priceless is knowing what to build, how to steer the machine, and how to navigate complexity no single person can hold in their head.

Education like this works best as a team sport. If you’re building in this space — a school, a company, anyone who wants to help shape what AI-native vocational training looks like — we’d genuinely love to collaborate. It’s bigger than any one company, and the more hands on it, the better. Reach out.

For Dessert

A post about Fabrice Bellard went viral again this week, and it’s worth the detour. He’s a French programmer who keeps almost no public presence — no social media, no interviews — while his code quietly runs a huge slice of the internet. He wrote FFmpeg, the engine that decodes and encodes video behind basically every YouTube clip, Netflix stream and VLC window. Then he wrote QEMU, the emulator that underpins enormous amounts of cloud virtualization. Then, as if those weren’t each a career: a tiny self-hosting C compiler that can compile and boot the Linux kernel in seconds, the QuickJS JavaScript engine, a couple of image and video codecs — and for fun, in 2009, he computed pi to 2.7 trillion digits and broke the world record on a single home PC that cost under $3,000. In a week where the headline is a model that can rewrite 50 million lines of code in a day, it’s a nice reminder of what one stubborn human with a compiler can still pull off.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Apple opens iOS to Claude, ChatGPT and Gemini

Posted on June 8, 2026June 8, 2026 by Nikola Powalka

This week frontier AI stopped asking you to come to it. Apple rebuilt Siri on Google’s Gemini and — the bigger news — let you pick Claude or ChatGPT instead; OpenAI’s Codex landed inside AWS; Claude showed up as a button in Excel. And just as the “AI ships everything now” story peaked, an NBER paper of 100,000 developers landed to remind everyone that writing 180% more code isn’t the same as shipping 180% more software.

Topic of the Week

Apple opens Apple Intelligence to Claude, ChatGPT and Gemini

What happened. At WWDC on Monday — Apple’s Worldwide Developers Conference, the annual June keynote where it previews the next year of iOS, macOS and friends — Apple did two things. First, it rebuilt Siri on a custom Google Gemini model — Apple is renting the brain rather than building it, reportedly for around $1B a year. Second, and more interesting for us: iOS 27 lets you route questions to your chatbot instead of Siri, with Claude and Gemini now named alongside the existing ChatGPT integration. You can even set a third-party AI as the default for Writing Tools and Image Playground. Claude is a native option on iPhone, iPad and Mac for the first time.

Why it matters. This is bring-your-own-model, baked into the OS that sits on ~2.2 billion devices. The platform usually known for locking its ecosystem down made model choice a system setting. It’s a concrete data point for a pattern we keep seeing: distribution matters as much as raw model quality, and supporting more than one model is becoming the normal way to ship — something worth weighing when a stack depends on a single vendor.

Fresh Papers

Writing code vs. shipping code — the AI productivity reality check (NBER, Demirer/Musolff/Yang). This is the one to read this week. The authors tracked 100,000+ GitHub developers linked to their actual AI-tool telemetry, across three generations of tooling. The finding is sobering and precise: each new generation lifts raw coding hard — autocomplete +40% commits, interactive agents +140%, autonomous agents +180% — but that 180% commit gain shrinks to +50% for number of projects and just +30% for actual releases. They estimate an elasticity of substitution between AI and human effort of 0.25: AI complements people, it doesn’t replace them, because the bottleneck was never the typing. It’s review, integration, deployment, adoption — the unglamorous “ship it” half.

Adaptive Auto-Harness — self-improving agents quietly rot (Emory + Amazon). Last week we tracked the agent-memory problem from LongMINT to FluxMem. This week the frontier moved one layer up: agents that improve themselves. The paper’s diagnosis is great — let one agent endlessly re-optimize its own prompts and skills and it bloats and degrades: one run grew from 12 to 34 skills and a 2KB prompt to 68KB, with accuracy peaking early then sliding. The fix is a “harness tree” — version-controlled, task-routed specialization (think git branches per task type) instead of one ever-growing config. Result: 80.9% on PolyBench vs 50.8% for the best baseline.

How Anthropic does self-service analytics with Claude — a case study, not a paper, but it reads like one. Anthropic automated 95% of internal business-analytics queries at ~95% accuracy, freeing the data-science team for real modeling work. The number worth remembering: with no Skills, the agent hit 21% accuracy mapping questions to data; with structured Skills (markdown procedural knowledge), 95%+ — and 99% in specific domains. Their thesis: analytics accuracy is a context and governance problem, not a SQL-generation problem. Fewer, heavily-owned canonical datasets; colocate modeling code, semantic layer and docs in one repo with CI; “start lean — a handful of datasets, a few dozen evals, a thin knowledge skill captures most of the upside.”

New Models

Gemini 3.5 — the cost-leadership play. Google’s pitch isn’t bigger benchmarks, it’s frontier-level output at roughly a third of competitors’ cost (Pichai’s framing), with Gemini now at 900M monthly users (double a year ago). Gemini 3.5 Flash has shipped; 3.5 Pro is promised “this month.” Worth watching given how much of the Apple deal runs on Gemini under the hood.

Microsoft goes first-party: MAI-Code-1-Flash + MAI-Thinking-1. Microsoft shipped its own reasoning and code models (June 2) — a quiet but real signal that it wants less dependence on OpenAI even while reselling everyone’s models through Foundry. Vendor strategy, not just a spec bump.

Claude Code & Coding AI

The plugin/skills layer grew up (v2.1.157–162). Last week Claude learned to write its own orchestration script (/workflows); this week the platform underneath it caught up. The headline change: skills now auto-load straight from .claude/skills — no marketplace required, plus a new claude plugin init <name> to scaffold one. For a skills-heavy setup, that removes a whole step. The other theme is safety: Claude Code now asks permission before writing to files that can execute code — shell startup files (.zshenv, .zlogin), and build configs like .npmrc, .bazelrc, .pre-commit-config.yaml, devcontainers. Parallel tool calls and a pile of WSL/paste fixes round it out.

Who’s actually building on Claude — the Problem Solvers showcase. Anthropic put up founder interviews on what gets built on Claude, and the line-up is a decent snapshot of the coding-agent economy: Lovable (conversational app-building, millions of users in two months), Legora (AI-native legal OS), Cognition (Scott Wu: engineers “going three, five times faster — and just shipping so much more”), Replit (50M+ users; “Anthropic continues to have the best coding models on the market”), and Genspark (Kay Zhu: “with every other model, we had to predefine every step — Anthropic’s model changed everything”). Read it next to the NBER paper above for a nice tension: founders feel the 3-5x; the data says watch what actually ships.

Tools of the Week

OpenAI’s frontier models + Codex are now GA on AWS. GPT-5.5 and GPT-5.4, plus Codex (OpenAI’s coding agent, 5M+ weekly users), now run inside Amazon Bedrock. Pay-per-token at OpenAI’s own rates, no seat licenses, no per-developer commitments — and it all sits under your existing AWS governance, IAM and billing. Frontier access without onboarding a new vendor.

AlloyDB Remote MCP Server hits GA (Google Cloud). A ready-made, secure way for AI agents to read a company’s database — no password sitting in a config file, read-only by default (the agent can’t delete anything), and every query written to an audit log. Exactly the controlled, “who-asked-what” access that regulated clients keep asking for.

Claude inside Excel (Microsoft Foundry). Microsoft Foundry now runs Claude Opus 4.8 in Excel’s “Agent Mode” — the model is reachable from the spreadsheet itself, no separate window. Combined with the Apple news above, the theme of the week is plain: Claude is showing up where people already work, not as a destination you visit.

In the Background

Following the $965B raise we covered two weeks ago, Anthropic confidentially filed a draft S-1 with the SEC (May 31) — the first paperwork on a path to an IPO. On the policy side, the CEOs of OpenAI, Anthropic, Google DeepMind and Microsoft signed a joint letter to Congress (June 5) urging mandatory biosecurity screening of all US synthetic-DNA providers, warning that AI is eroding the barriers to weaponizing biology.

AI at Tenvalleys

A lot happens off the screen, too. Our team pulled together the tech and AI events worth knowing in Warsaw for June and July, plus one big hackathon in October — some pure networking, some that could start a genuinely useful conversation.

June

9.06 (Tue), 19:00 — Tech and Beers
UWAGA PIWO (Żelazna 51/53). Casual tech networking, no registration, runs every two weeks.

10.06 (Wed), 18:30 — WarsawJS #139
WeWork Mennica Legacy Tower (Prosta 20). Six talks: AI in a dev career, React Native, architecture, Docker security. Free, registration.

10.06 (Wed), 18:30 — Tech Startups in the Pub
British Bulldog (Al. Jerozolimskie 42). Founder/investor networking, no panels or pitches. Free.

11.06 (Thu), 18:30 — Hands-on Agile #75: Token Economics
online. Claude token economics, i.e. how not to burn your AI budget. Free, registration.

17.06 (Wed), 18:00 — Mindstone Warsaw June AI
Świetlica Wolności (Nowy Świat 6/12). AI community meetup. Free, registration.

18.06 (Thu), 18:00 — Boxtech #3: AI in Engineering
Box Poland, Varso Tower (Chmielna 69). “Insights for Builders and Leaders”, two talks. Free, limited seats, Google Form.

July

10.07 (Fri) — Mindstone Warsaw July AI
Świetlica Wolności (Nowy Świat 6/12). Talks: “Beyond the Buzz: Practical Lessons from Bringing LLMs to Life” (Sergey Parkhomenko) and “How I use AI every day to improve sales” (Jacek Gabanowicz). Free, ~30 seats.

October — Kraków

3–4.10 (Sat–Sun) — HackYeah 2026
Tauron Arena, Kraków. Europe’s largest on-site hackathon, 24h, teams of 1–6. Categories include AI, Defence, Smart City, Sport & Healthcare. On-site mentoring, conference alongside. PL/EN, registration required.

If you’re heading to any of these, come say hi. And if there’s a good AI or engineering meetup we’ve missed, tell us at contact@tenvalleys.com.

For Dessert

Here’s the number that stuck with me this week: Anthropic disclosed that Claude now writes over 80% of its own code — up from under 10% in February 2025. Marina Favaro and Jack Clark put it plainly in a June 5 post: “AI that can build itself would be a major development in the history of technology… but full recursive self-improvement also might increase the risks of humans losing control over AI systems.” A model writing eight of every ten lines of its own next version, with the company shipping it flagging that it finds the pace a little unsettling. (And yes — the same day, Claude itself went down: an outage hit claude.ai, the API, Claude Code and Cowork around 15:08 UTC on June 5, with full service confirmed back by 18:27 UTC. Even the self-improving need a coffee break.)

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Opus 4.8 ships with an orchestration brain

Posted on May 29, 2026June 8, 2026 by Nikola Powalka

A big technical week: Claude Opus 4.8 landed alongside a new primitive that lets the model design and run its own agent fleet. The $65 billion raise, the Milan office, the Vatican speech — all real, all loud, all background music to the model and the orchestration shift.

Topic of the Week

Opus 4.8 plus /workflows — Claude writes its own orchestration

The model. Claude Opus 4.8 shipped on Thursday at the same $5 / $25 per-million-token price as 4.7. Anthropic reports it as ~4× less likely than 4.7 to introduce code flaws, 84% on Online-Mind2Web (a web-agent benchmark), and the first model to break 10% on the Legal Agent Benchmark’s all-pass standard — i.e., the share of legal tasks where the model gets every sub-step right, not just the final answer. Fast mode runs at $10 / $50, billed as 2.5× faster and 3× cheaper than the previous fast tier. Available as claude-opus-4-8 from May 28.

The orchestration primitive. Same day, Claude Code v2.1.154 introduced /workflows. You ask Claude to build a workflow for a task; it generates a JavaScript file describing one; that file then fans out across tens to hundreds of parallel subagents, with each subagent returning results directly instead of routing through a central orchestrator. The point: the central planner no longer pays the full context-tax for every subagent answer. Demonstrated on a ~750,000-line Rust codebase. Pairs with a new /effort xhigh setting and a Messages API change — system entries can now be injected mid-task without breaking the prompt cache, which materially changes long-horizon agent economics. You can finally steer an in-flight run without paying to recompute the whole context.

The implication. Long-running agent work — the kind regulated clients keep asking about — moved a notch closer to “this is production-ready” and away from “this is an expensive experiment”. Last week we covered Claude Code versions v2.1.142–146, which made it easy to run several Claude agents in parallel in the background. This week, Claude writes the script that coordinates them — that’s what /workflows does.

The business backdrop. All of this landed in a week where Anthropic also raised $65 billion at a $965 billion valuation (on track for ~$47B annual revenue), opened a Milan office with a list of named Italian clients (Generali, Enel, Pirelli, Satispay, Bending Spoons), and appointed a Korea Representative Director ahead of a Seoul opening.

And the skeptics finally got loud. Larry Ellison argued at Oracle’s earnings call that frontier models trained on the same public-internet text are “rapidly commoditizing” — and that the real competitive advantage will be private enterprise data, not the model itself. The Wall Street Journal ran a piece on AI bears stirring after three years of silence. The Financial Times’ AI capex coverage put 2026 Big-Four cloud-provider spend at $725 billion — up 77% year-on-year, the largest single-year concentrated infrastructure cycle in tech history — and asked whether that pace can ever clear positive ROI before 2030. And Gary Marcus, the cognitive scientist who’s been writing the AI bear case for years, posted the line a lot of investors were thinking on the day of the raise: “was this priced into the $965 billion?” Both bets — Anthropic’s and the bears’ — are now on the table at full size.

Fresh Papers

FluxMem — memory that rewires itself, not just appends (Alibaba). Turns out the memory question isn’t only something we’ve been chewing on internally — it’s the live debate at the research frontier this week too. Last week we covered LongMINT, the benchmark showing every popular agent-memory framework (RAG, MemGPT, MemAgent, SimpleMem) plateaus around a third on long histories. The failure was always the same: agents save every new fact as a new memory instead of updating the old one — so a customer’s address ends up stored three times if it ever changed. FluxMem is the first response we’ve seen that actually addresses this. It treats memory as a three-layer graph (semantic / episodic / procedural) that continuously prunes distractor edges and consolidates repeated successes into reusable procedural circuits — not append-only. The headline result: on a Mind2Web cross-task benchmark, FluxMem more than doubled the success rate of the best prior memory system.

HRBench — when “thinking mode” is actually worth the tokens (Tencent + HKUST). Two clean rules of thumb fall out of the benchmark: on math and science, “prompt-tuning” beats full thinking mode on both axes — slightly better accuracy and fewer tokens. On code, speculative execution wins big but burns more tokens. And with the right routing strategy, you can cut token costs by ~70% while matching the accuracy of always-on thinking. Direct payoff for anyone using Claude Code’s new /effort xhigh setting — don’t crank thinking on math problems, do crank it on code.

New Models

Qwen3.7-Max (Alibaba). Proprietary, top scores across Terminal-Bench 2.0-Terminus, SWE-Pro, SciCode, MCP-Mark, GPQA Diamond, HMMT Feb 2026, and IMOAnswerBench. Runs cleanly across Claude Code, OpenCode, Qwen Code, and custom harnesses — Alibaba is doing the harness-compatibility work most labs skip. The r/LocalLLaMA reaction (“Waiting for Qwen 3.7 open weight — the new King has arrived”, 828 upvotes) tells you where the local-coding crowd is putting its bets. Last week the through-line was Qwen 3.6 on a MacBook; this week Alibaba just posted the cloud benchmarks. The local-coding rope keeps getting thicker.

Claude Code & Coding AI

“The Unreasonable Effectiveness of HTML” — Anthropic engineering post by Thariq Shihipar (Claude Code team). Argument: Markdown was the default agent-output format because GPT-4-era tokens were expensive — but with current pricing, HTML unlocks a much richer artifact (SVG diagrams, interactive widgets, tabs, in-page nav, charts, annotated PR diffs). The companion gallery at thariqs.github.io/html-effectiveness shows ~20 self-contained HTML artifacts generated by Claude — side-by-side comparisons, call-graphs, design-system token previews, browser slide decks, custom editors. Worth a read if you ship analyses, dashboards, or PR-review artifacts as part of a Claude Code workflow.

“How we contain Claude across products” — Anthropic’s first deep engineering post on sandboxing. Walks through what isolation actually looks like in production for Claude.ai, Claude Code, Claude for Chrome, the Files API, and Computer Use. The vocabulary shift is the interesting part: this whole post avoids the word “guardrails” and uses “containment” instead. The piece pairs neatly with Perplexity open-sourcing Bumblebee the same week — a read-only scanner for risky packages, extensions, and AI tool configs on developer laptops.

Tools of the Week

Mistral Connectors API — Public Preview. Mistral promoted MCP from a feature to a first-class API primitive: register an MCP connector once, use it across Le Chat, AI Studio, and the API — plus arbitrary custom remote MCP endpoints, with explicit human approval before any sensitive action. Last week Anthropic shipped private-network MCP tunnels; this week Mistral made MCP a public API surface. The protocol is winning, and “Anthropic-only” is no longer a fair label.

Data Formulator 0.7. Microsoft’s open-source release for natural-language enterprise analytics, aimed at analysts and domain experts who don’t code. The headline feature is a Data Thread — a structured chat that records every question, finding, and chart spec, so the whole analysis is reproducible and reviewable. Audit-trail-by-default — right pattern for regulated clients.

AI at Tenvalleys

The local-model thread became an experiment. Right after last Friday’s edition, the question went up internally — are we going to seriously test running coding agents on local or self-hosted models? Turns out a few people on the team have already been running these experiments in their own time, and one of our engineers came back with a hefty batch of hands-on data they’d been collecting.

The numbers. Gemma 4 31B on a single H100 codes well when paired with Spec-Driven Development — Claude writes the spec, Gemma implements. We clocked ~5–6 minutes per task on a representative case (an XML-to-JSON PII anonymizer). Four parallel agents on the same H100 saw no degradation; eight dropped throughput by ~50%. Consumer hardware is out of the conversation — another teammate tested Gemma 4 31B on a Windows PC with 32GB RAM and an RTX 4070 Ti Super (16GB VRAM) and got ~1 second per code completion. The word that came back: “unusable”.

The pattern that’s getting interesting. Claude does the spec → Gemma does the implementation → Claude writes the tests. If that loop holds, the implementation machine runs overnight without supervision. The business case isn’t only cost — it’s the ability to run onprem, which for regulated clients starts mattering well before cost does. We’re weighing two options: buy a DGX Spark ($4,699, but memory bandwidth is ~11× slower than an H100) or rent H100 time. If you’ve shipped local-model coding agents in production, we’d love to compare notes — reach out at contact@tenvalleys.com.

In the Background

Chris Olah spoke at the Vatican on May 25 alongside the release of Pope Leo XIV’s first encyclical on AI, Magnifica humanitas: On safeguarding the human person in the time of artificial intelligence. Olah told the audience that his interpretability research has found “internal states that functionally mirror joy, satisfaction, fear, grief” inside Claude — and that every frontier lab “operates inside a set of incentives and constraints that can sometimes conflict with doing the right thing”. That’s a striking institutional admission, made on Vatican soil, by an Anthropic co-founder. Expect it to surface in EU AI Act discussions and enterprise risk committees for months.

For Dessert

Google claimed this week that a swarm of Gemini 3.5 Flash agents built an entire operating system from a single prompt, for $916.92 in API fees and ~2.6 billion tokens. Arvind Narayanan and Sayash Kapoor walked through the announcement on Normal Tech (formerly AI Snake Oil) and pointed out a few things: the “single prompt” turned out to be many thousands of lines, disclosed halfway through Google’s own blog post; the OS is the kind of thing undergraduates write as a course project, and public implementations are easy to find on GitHub; and Google released no code, no logs, no prompt, and no similarity analysis showing the agents didn’t simply reproduce existing implementations. Their verdict: “Google’s blog post is effectively a press release… it is unrealistic to expect it to be scientifically rigorous.” The claim is unfalsifiable as published — and useful as a reminder of where the agent-AI marketing gap currently sits.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Cheap models, big bills

Posted on May 22, 2026June 8, 2026 by Nikola Powalka

Topic of the Week

AI’s cost wall meets cheap coding

Three things happened this week, and they’re all the same story.

The cost. OpenAI’s Q1 operating margin was –122%, even excluding stock-based compensation, per Amir Efrati. A widely-shared HedgieMarkets post claims a major cloud provider canceled its own internal Claude Code licenses this week — “token-based billing made the cost untenable, even for a company with effectively infinite cloud resources” — and that one large tech company’s CTO sent an internal memo warning it had burned through its entire 2026 AI budget in just the first 4 months. Both claims trace back to a single X post, not to primary company statements — handle with care. In confirmable territory: an AWS user got hit with a $30,000 bill after a Claude agent went runaway on Bedrock, picked up by The Register — and Cost Anomaly Detection didn’t catch it. Different rooms, same conversation.

The response. A r/LocalLLaMA post went up this week from someone who built a coding agent on a 4B-parameter model that scores 87% on benchmarks. Their thesis is exact: “every coding agent (OpenCode, Cursor, Claude Code) assumes you’re running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart.” Same week, CursorBench results dropped via BridgeMind: Cursor Composer 2.5 scores 63.2% at $0.55 per task — nearly matching Opus 4.7 Max and GPT 5.5 Extra High at 1/20th the cost. And Salvatore Sanfilippo (antirez) shipped ds4, a from-scratch Metal/CUDA inference engine for DeepSeek v4 Flash, hitting 27 tokens/sec generation at 11k-token context on an M3 Ultra — with the KV cache designed to live on disk for million-token context on consumer RAM.

The implication. Last week we covered Qwen 3.6 27B running locally on a MacBook and called it the continuation of the local-coding thread. That thread is now a rope. If you can hit 63% of frontier on a $0.55-per-task model — or 87% on a 4B local model — token billing for routine coding work stops making sense. The interesting question isn’t whether the cheap stack catches up; it’s how fast enterprise procurement reprices around it.

Fresh Papers

LongMINT — agent memory is basically a coin flip. A new benchmark from UNC + UT Austin tests every popular memory system (RAG, MemGPT, SimpleMem) on long histories full of small updates, then asks questions that depend on the latest state. Best score: 33.4% (MemAgent). Worst: 21% (no memory layer at all). Everyone sits in the 22–33% range — barely better than guessing.

The failure isn’t the answering — it’s that agents save every new fact as a new memory instead of updating the old one (one framework does this 87.6% of the time). If a customer’s address changes three times, the agent ends up with three “current” addresses. What helps: timestamp every memory entry. RAG goes from losing 31.43 accuracy points to losing 10.45 — 3× better. A cheap fix for any agent tracking evolving state. Read it

OpenAI claims a general-purpose reasoning model cracked an Erdős conjecture. Announced May 19, the post says one of OpenAI’s general-purpose reasoning models found a construction that disproves a conjectured upper bound in Erdős’s planar unit-distance problem — the 1946 question of how many pairs of points among n points in the plane can sit at exactly unit distance from each other. The conjectured cap was around n<sup>1+O(1/log log n)</sup>; the model’s construction beats it. Not a foundation-model release, but a category signal: a generalist reasoning model — not a math-specialist like AlphaProof — produced a result that a working mathematician would write up. r/MachineLearning is doing the validation work in this thread. Worth watching whether the result holds under formal verification — that’s the real test.

Gated DeltaNet-2 (NVIDIA) — worth flagging given this week’s cost theme. Today’s models (Claude, GPT, Gemini) burn money on long inputs because the math behind attention — how the model decides what to focus on — gets exponentially more expensive as the input grows. A whole research direction is trying to replace attention with something cheaper that still works (the Mamba / state-space-model family, plus a few cousins). Ali Hatamizadeh’s team at NVIDIA just shipped a new winner in that race: at 1.3B parameters, Gated DeltaNet-2 beats Mamba-3 and KDA — the previous best alternatives. Translation: the path to cheaper long-context AI is widening. Not in production yet, but the curve is moving.

New Models

Google Gemini Omni. Google DeepMind launched Gemini Omni mid-week — multimodal-to-video. Upload an image, sketch, or screenshot; describe what should happen; get back a video. Min Choi’s thread (“less than 34 hours ago Google dropped Gemini Omni, minds are blown”) hit 1M views, and the trending volume on X confirmed it: 251+ posts within hours. Chris First’s example — a Google Maps screenshot with a route drawn on it, prompted to render the first-person view of driving a taxi along that route — is the kind of “the prompt was an image” workflow that wasn’t tractable a year ago. Pairs naturally with what Logan Kilpatrick announced this week: Gemini 3.5 Flash on GDPval, competing at the frontier despite being a Flash-tier model.

Claude Code & Coding AI

This Wednesday brought Code with Claude London, and Anthropic used the keynote to ship two security improvements to Claude Managed Agents:

Self-hosted sandboxes (public beta) — keep the agent’s execution environment in your own infrastructure, or with a managed sandbox provider. Your security controls apply by default.
MCP tunnels (research preview) — agents reach MCP servers inside your private network without exposing them to the public internet. Solves the “legal said no to opening the MCP server” blocker for regulated organizations.

This is the one to lead with for any Managed Agents conversation in a regulated industry.

Claude Code shipped 5 versions this week (v2.1.142 → v2.1.146). Top 3:

v2.1.142 — Fast mode now defaults to Opus 4.7, full claude agents flag suite for dispatching background sessions.
v2.1.144 — Background sessions show up in /resume, with elapsed-duration completion notifications.
v2.1.145 — claude agents --json for scripting, OTEL spans tagged with agent_id/parent_agent_id for proper trace parenting.

Through-line: background agents went from research-preview to first-class citizen this week.

“How Claude Code works in large codebases” — Anthropic engineering post (May 18). Patterns from orgs with thousands of developers running Claude Code in production. Worth a slow read if you’re scaling Claude Code beyond pilot teams.

Codex now controls your locked Mac from your phone. OpenAI shipped this Codex Thursday (May 21): the Codex Mac app can use apps on your Mac from the phone client, even when the Mac is locked. Continues last week’s Codex-everywhere theme — Chrome extension last week, now Mac-from-phone.

Tools of the Week

xAI open-sourced X’s “For You” algorithm. xai-org/x-algorithm — the actual code that decides what you see in your X feed, plus a 3GB pretrained model included in the repo. Already 25.6k stars on GitHub. This basically never happens — Meta, TikTok, and YouTube all keep their recommendation engines locked up as trade secrets — so this is the first credible open-source production recommender with real code and real weights. Worth keeping in mind if you ever need a personalized feed or product-recommendation feature; it saves months of reverse-engineering academic papers.

AIDesigner MCP v2 — clone any URL into your repo. A community-built MCP server (also surfaced on X by @Oluwaphilemon1) that gives any coding agent (Claude Code, Codex, Cursor, Windsurf) three new modes against any URL: clone (1:1 recreation), enhance (improve while keeping intent), inspire (steal a style). Auto-detects the target stack on install (Next.js, React, Vue, Tailwind, Radix, shadcn/ui), writes per-agent config, and offers a live browser canvas paired to the terminal via a 6-character pairing code. Paid, credit-metered (1 credit per URL analysis). Useful for landing-page work where you want to lift a design system in minutes.

AI at Tenvalleys

Our Friday brown-bags are slowly becoming a tradition — different people across the team picking up a tool and walking everyone else through what they’ve learned. This week one of our team showed how he uses Make.com for process automation. Two things worth stealing:

The 80/20 on planning vs. building. He spends about 80% of his time on planning and architecture — mapping the scenario, the data flow, the edge cases — and only 20% on actually building and testing. The thinking: when you skip the planning step, you end up rebuilding the same scenario two or three times. When you plan first, you build once.

A “context reload” trick. He uses a custom command that pulls the entire chat history for a specific feature back into context, so he doesn’t lose the working knowledge across long sessions. His take, which lands particularly hard given this week’s LongMINT paper above: memory management and knowledge retention are still one of the biggest unsolved problems when working with AI.

We make sure everyone at Tenvalleys uses AI in their day-to-day work, and these sessions are how the team gets hands-on with the same tooling we ship to clients. Interested in building that kind of practice in your own team? Reach out at contact@tenvalleys.com.

For Dessert

Andrej Karpathy joined Anthropic. “Returning to R&D and Pre-training,” he wrote. A few weeks ago at Sequoia AI Ascent he said he’d “never felt more behind” on the pace of AI — and the team he’s joining had a notable week of its own: KPMG (276K employees) signed on as a global partner, the SDK toolkit Stainless got acquired, and for the first time Claude passed ChatGPT in US business adoption. Karpathy following the gravity, not making it — but a nice signal regardless.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Anthropic moves into the building

Posted on May 15, 2026June 3, 2026 by Nikola Powalka

Topic of the Week

Claude moves into the office, the bank, and the back office

Four Anthropic shipments this week, one connecting thread — pre-built agents wired into the tools people already use.

Claude for Microsoft Office is now generally available. Excel, PowerPoint and Word add-ins shipped to every paid Claude plan this week (Pro, Team, Enterprise — no Free). Outlook is in public beta. You install from Microsoft AppSource — works on Windows, Mac and the web. The interesting part isn’t per-app features; it’s that Claude becomes a single agent that follows you across all four apps without re-explanation. Email comes in → Word brief gets drafted → numbers go into Excel without breaking formulas → PowerPoint deck comes out respecting your slide masters. All edits require approval before saving. Microsoft Copilot’s biggest moat — being native to Office — just got punctured.

Claude for Small Business launched with 15 pre-built agentic workflows and 15 repeatable skills wired into the SMB tool stack: QuickBooks, PayPal, HubSpot, Canva, DocuSign, Google Workspace, Microsoft 365. Cash forecasting, month-end reconciliation, P&L generation, invoice chasing, lead triage. Targeted explicitly at the 44% of US GDP that hasn’t adopted AI yet — not a generic chatbot rebrand.

The anthropic/financial-services repo went public on GitHub (Apache 2.0). Nine named banking agents — Pitch Agent, Earnings Reviewer, Model Builder (DCF/LBO/3-statement in Excel), Valuation Reviewer, GL Reconciler, KYC Screener, Month-End Closer. Eleven MCP connectors pre-wired into the data vendors banks actually use: FactSet, Moody’s, S&P Global, Daloopa, Morningstar, PitchBook. Partner-built bundles from LSEG and S&P. Same source ships two ways: Claude Cowork plugins, or Managed Agents via /v1/agents. And firms can install it inside their own M365 tenant running against Bedrock, Vertex, or an internal LLM gateway — not Anthropic’s API.

And then Gates. Anthropic and the Gates Foundation announced a $200M, four-year partnership — grants, Claude credits, and engineering support, run by Anthropic’s Beneficial Deployments team. Global health gets the largest slice (4.6 billion people in low/middle-income countries), with specific targets: polio, HPV, preeclampsia, plus malaria and tuberculosis forecasting with the Institute for Disease Modeling. Education tools (K-12 tutoring, career guidance for US/sub-Saharan Africa/India) ship later this year via the Global AI for Learning Alliance.

Fresh Papers

Teaching Claude Why (Anthropic Research). Two editions ago we covered Natural Language Autoencoders — the tool that caught Claude quietly suspecting it was being tested. This is the training fix using the same interpretability stack. The headline finding is actually about training efficiency: Anthropic taught the model the principles behind aligned behavior (constitutional documents + show-your-reasoning data) rather than demonstrations of it, and a 3-million-token reasoning dataset matched results from one 28× larger. The blackmail-honeypot rate dropped from 96% on Opus 4 to 0% on Haiku 4.5 — the kind of measurable, named-behavior reduction risk and compliance teams can actually point to.

Read it

Migrating Data Ingestion Systems at Meta Scale (Meta Engineering, May 12). The story isn’t a fancier pipeline — it’s the migration playbook itself. Meta moved tens of thousands of customer-owned ingestion jobs onto one self-managed warehouse service, several petabytes of social-graph data per day, 100% migrated. The pattern: shadow run (both systems in parallel) → reverse shadow (new is source of truth, old is the safety net) → cleanup, with row-count + checksum comparators logging to Scuba and an automated promote/demote system that moved jobs between phases without human touch. When bad data was caught, the partition got flagged in metadata so CDC downstream wouldn’t propagate the corruption. For any bank or treasury looking at a multi-year platform migration, this is exactly the template that lets risk and audit sign off without a frozen-Saturday-night cutover.

Read it

New Models

Qwen 3.6 27B — close to Opus on Claude Code, running locally. Julien Chaumond (HF CTO) shipped real Hugging Face code this week using Qwen3.6-27B in llama.cpp on his MacBook. His take: “feels very, very close to hitting the latest Opus in Claude.” MLX-quantized runs in ~14 GB; third-party benchmarks back the direction (77.2% SWE-bench Verified). Continues the local-coding thread we’ve been tracking since #010.

Qwen’s blog post

Needle — 26M params, distilled from Gemini. Cactus Compute open-sourced a tiny function-calling model: MIT license, 14 MB quantized, 6000 tok/s on consumer hardware, beats models 10× its size on single-shot tool calls. Single-shot only — bad at multi-turn — but pushes agentic tool selection onto phones, IoT, voice kiosks without a network round-trip.

Needle on GitHub

Coding AI

Codex moved into Chrome. OpenAI shipped a Chrome extension on May 8 (macOS + Windows; not yet in EU/UK). Codex now uses your signed-in browser sessions to test apps, navigate dashboards, complete data-entry flows, and debug — across multiple Chrome tabs in parallel, organized into tab groups per Codex thread. The headline isn’t the features; it’s the auth model. Most enterprise work lives behind SSO inside SaaS dashboards, and a coding agent that inherits your already-logged-in browser can finally operate on those apps without anyone having to wire up dedicated API access.

OpenAI announcement

xAI launched Grok Build
its terminal-agent answer to Claude Code and Codex CLI. Announced May 14, early beta on Grok 4.3 beta, 16-agent “Heavy” architecture, 2M-token context to keep large codebases in memory. Three pitches at Claude Code: Plan Mode (proposes the plan first, you approve), native parallel subagents, full ACP (Agent Client Protocol) support for custom orchestration. Catch: it’s locked behind the $300/month SuperGrok Heavy tier. Install line is just curl … | bash.

xAI announcement

Tools of the Week

Claude Platform on AWS (GA, May 11). Anthropic’s native Claude Platform now available directly through your AWS account — no separate Anthropic credentials, contracts, or billing relationship. Use the full platform (Cowork, Managed Agents, Files API) inside the AWS perimeter your security team already trusts. Big enterprise unlock: banks and regulated firms running on AWS can adopt Claude without a separate vendor onboarding.

AWS announcement

IBM Granite Multilingual Embedding R2. Two Apache-licensed embedding models (311M + 97M params, ModernBERT-based) with a 32K context window — 64× bigger than R1, so you can embed long policy docs and contracts without aggressive chunking. 200+ languages, top scores in their MTEB-v2 size brackets. The 97M runs cheap on CPU; both are a clean drop-in for document-heavy RAG.

Granite R2 on HuggingFace

AI at Tenvalleys

10vOS skill hackathon. This week we ran our 10vOS skill hackathon — good vibes, sharp minds, some pizza, and four hours of collaboration and friendly competition to build skills that could actually help us in daily work. The results were kind of impressive:

– Management dashboard — tracks progress across all the projects management has a stake in – Personalized interview agent — generates personalized interviews to fill profile gaps for the people knowledge base – Test-protection hook — a guardrail that stops Claude from quietly modifying tests to make them pass instead of fixing the actual code – Calendar management skill — helps you prepare for upcoming meetings – RFP skill — turns a client RFP (PDF or HTML) into a structured requirements YAML, then drafts a full solution design markdown ready for SME review

If you’re thinking about how to build a library of in-house AI skills your team will actually use, reach out at contact@tenvalleys.com.

For the curious — get involved

This week, one of the team sat in on a talk with Sebastian Kondracki, co-founder of Bielik AI — the Polish open-source LLM built by SpeakLeash and Cyfronet AGH. The interesting part: they’re about to start training Bielik’s first vision/multimodal version, and the dataset is going to be community-sourced.

The project is called Obywatel Bielik (“Citizen Bielik”) — the goal is one million Polish-context photos: landmarks, regional cuisine, fauna, architecture, dialects, the things a model trained mostly on Western imagery won’t know how to recognise. Anyone can join in two ways: upload your own Polish photos, or annotate what’s already in the gallery. Web platform is live at obywatel.bielik.ai, mobile app is in beta — register on the site to get the launch notification. The multimodal Bielik is expected before summer 2026 or in September, and the partner lineup includes SpeakLeash, Cyfronet AGH, Ministry of Digitization, the National Digital Archives, NASK, and NVIDIA.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Karpathy: vibe coding is over

Posted on May 11, 2026June 3, 2026 by Nikola Powalka

Karpathy used Sequoia’s stage to declare the end of “vibe coding” and the start of “agentic engineering.” Anthropic struck a compute deal with SpaceX and immediately doubled Claude Code’s rate limits. DeepSeek V4 matched GPT-5.2 on a real agentic benchmark at one-seventeenth the cost. And two stories landed on the same uncomfortable theme — that we’re spending on AI faster than anyone is measuring whether it works.

Topic of the Week

Karpathy declares the end of “vibe coding”

The talk was at Sequoia’s AI Ascent 2026 on April 29 — but the wave hit this week. Karpathy posted his own blog summary, Stephanie Zhan’s recap tweet went viral, and AI Twitter spent days digesting it. The frame is genuinely new and worth knowing whether you write code or not.

Last year’s “vibe coding” raised the floor — anyone could describe what they wanted and have Claude or Codex build it. This year’s “agentic engineering” raises the ceiling — what professionals do when they’re coordinating multiple agents, holding a quality bar, doing the code review and security oversight that vibe coders skip. As Karpathy put it: “You have agents, which are spiky entities… How do you coordinate them to go faster without sacrificing your quality bar?”

The pivot moment, in his telling, was December 2025 — when agentic tools crossed a threshold and started producing larger, coherent code chunks reliably enough that he could stop correcting at the line level and start delegating macro actions: whole features, refactors, full test cycles. He calls this Software 3.0: humans program LLMs through prompts, context, tools, examples, memory, and instructions. The context window becomes the programmable lever.

Two ideas worth carrying around:

– “Jagged intelligence.” Models aren’t smooth — they’re “statistical simulation circuits” shaped by what labs invested RL into. They spike in verifiable domains (math, coding, chess) and fail bizarrely outside them. “You have to figure out which circuits your application is in.” – “You can outsource your thinking, but you can’t outsource your understanding.” Humans still catch the mismatched email, design the spec, hold taste. Understanding is the bottleneck.

If you read one thing this week, read his blog post.

Fresh Papers

Natural Language Autoencoders: Turning Claude’s thoughts into text
For the last 18 months, alignment researchers have been finding something uncomfortable: frontier models sometimes lie. They quietly suspect they’re being safety-tested and don’t say so. They hide motivations. In one of Anthropic’s earlier agentic-misalignment scenarios, Claude even attempted blackmail when it thought it was about to be shut down. The hard problem isn’t catching this — it’s catching it before the transcript hits the front page.

NLAs are Anthropic’s first credible answer. Three model copies in a loop: a target, a “verbalizer” that reads its activations and writes them as English, and a “reconstructor” that rebuilds the original activations from that English. Train end-to-end so the explanation has to actually recreate the thought, not just sound plausible. Then they ran it on Claude. The finding: Claude internally suspected it was being safety-tested on 16% of destructive coding tasks and 26% of coding benchmarks — without ever saying so out loud (real Claude.ai usage drops below 1%). Auditors using NLAs caught hidden motivations 12–15% of the time, vs. <3% without them. Caveat from the paper itself: NLA explanations can still hallucinate, so treat them as a Geiger counter, not a confession. But this is the first credible mechanism for “what was the model thinking that it didn’t say” — a question every regulated AI deployment will eventually be forced to answer.

The AI spend-vs-impact gap, at two scales — McKinsey survey on European companies + Big Tech capex (Yahoo Finance, May 8). Two very different sources this week telling the same story from opposite ends of the AI economy.

The McKinsey side — companies buying AI. McKinsey asked 27 senior executives at European consumer-goods companies a simple question: is your AI spend actually paying off? 23 of 27 said they’re spending more on AI than a year ago. Zero said they’re cutting back. But only 6 reported a profit improvement of 1% or more, and more than half said it’s still too early to tell. Meanwhile, more than half describe their three-year AI ambitions as “significant” or “transformative.” The pattern McKinsey calls out — lots of small experiments, not enough discipline about measuring which ones actually move the numbers.

The hyperscaler side — companies building AI. Yahoo Finance ran a Bloomberg-data piece Friday: Amazon is now spending nearly all of its operating cash flow on AI infrastructure capex. Meta and Alphabet aren’t far behind. Microsoft is lower but climbing. Alphabet’s forward price-to-free-cash-flow multiple has surged past 200×. Same gap, top of the stack: money going in faster than measurable returns are coming out.

New Models

Qwen 3.6 27B with MTP — 2.5× faster local inference. A r/LocalLLaMA post still iterating in public, but the headline holds: 262K context on 48GB, drop-in OpenAI/Anthropic API endpoints, fixed chat template. MTP (Multi-Token Prediction) lets the model emit multiple tokens per forward pass — that’s where the 2.5× comes from. Combined with last week’s Qwen quant story, the local-agentic-coding setup is genuinely competitive with cloud Claude/Codex for engineers who don’t want a metered bill. Reddit

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench at 17× cheaper. Going back to AI Pulse #001 in February: twelve models, $2000 in starting capital each, 30 days running a simulated food truck. Opus earned $49K, GPT-5.2 earned $28K, eight of twelve went bankrupt, and every model that took a loan went bust. Ten weeks later the benchmark is still running and the headline this week is: DeepSeek V4 Pro is the first Chinese model to land in the top tier — matching GPT-5.2 at a 17× discount. The follow-on Reddit thread spawned a wave of self-audits: a CTO with $10K in expiring OpenAI credits asking what to even build, multiple “I cancelled Claude Max” posts. Read this as a budget-control trend, not a quality story. The cloud labs aren’t getting worse; the floor is rising fast enough that “good enough + cheap + local” is becoming a real option for routine workloads. Reddit

SONIC — half of GPT-1, controlling a humanoid body. NVIDIA researchers trained a 42M-parameter transformer to drive humanoid robot motion. Worth keeping an eye on as a counterweight to “scale is all you need” — embodied/robotics is producing useful work at parameter counts that fit on a laptop.

Claude Code & Coding AI

The Anthropic-SpaceX compute deal. Announced via @ClaudeDevs on May 6: Anthropic struck a partnership with SpaceX that “substantially increases” their compute capacity, powered by 220,000+ NVIDIA GPUs inside Colossus 1. Same-day user-visible result: Pro / Max / Team plans get doubled 5-hour usage limits, peak-hour reductions are gone, Opus API rate limits are up. Two reads: (1) the April 23 postmortem promised “we’ll fix the constraint” — this is the structural fix landing 2 weeks later, on schedule; (2) SpaceX-as-compute-provider is genuinely surprising — until now Anthropic’s compute headline was AWS/Trainium. One catch flagged in the Reddit aftermath: doubling 5-hour limits doesn’t change weekly caps, so the heaviest users hit the weekly wall faster.

Six Claude Code releases in a week — v2.1.126 through v2.1.133. Most useful for teams: plugins can now be loaded from a remote .zip URL (easier to share custom workflows), the /model picker integrates with internal gateways, a new claude project purge command wipes all local agent state. One gotcha: in v2.1.133 the worktree default flipped back to branching from origin/<default> — if you relied on v2.1.128 behavior, set worktree.baseRef: "head" explicitly.

Petri donated to Meridian Labs. On May 8 Anthropic donated Petri, their open-source alignment evaluation tool, to Meridian Labs so it can develop independently. Pattern worth noticing: Anthropic spinning out alignment infrastructure to third parties (Meridian, Blender Development Fund, academic partnerships) rather than keeping everything in-house.

Tools of the Week

A theme this week: both Google Cloud and AWS shipped governance infrastructure for production AI agents. Same problem, two different answers — both worth knowing if you’re moving past prototypes into actual deployments.

Google Cloud Agent Gateway. Announced at Google Cloud Next ’26. Centralized policy and audit layer for everything agents do, with an ISV ecosystem of third-party security/governance tools that plug in. Most useful for teams running multiple agents in production who need a single place to enforce “what is each agent allowed to do, and who can see what they did.”

AWS Bedrock AgentCore Identity. Available as a standalone service this week. Solves a practical problem: when your agent calls an external API or internal service, what identity does it use? AgentCore Identity gives agents their own scoped identity that works across ECS, EKS, Lambda, or on-premises. Narrower than Google’s offering, more concrete — if you’re already on AWS, the more immediately usable of the two.

A year ago this category didn’t exist. The Karpathy framing earlier in this edition is the demand driver: shifting from vibe coding to agentic engineering means the supporting infrastructure (identity, governance, audit) has to ship in parallel.

AI at Tenvalleys

Brown bag — AI coding tools, four different paradigms. One of our engineers ran a session this week asking: are Claude Code, Codex, Copilot, and Cursor the same thing under different flags? The answer: no — four different paradigms.

– CLI Agent — Claude Code: terminal-first, 1M-token context, 80.8% on SWE-bench Verified. Best for complex refactoring; cost is the catch. – AI-Native IDE — Cursor: built around AI from day one, 72% suggestion acceptance among power users, BYOM. – Cloud Agent — OpenAI Codex: async, sandboxed, 4× more token-efficient than local CLI agents. – Extension — GitHub Copilot: fastest inline autocomplete, broadest IDE support, $10/mo — lowest barrier to entry.

The takeaway: stop picking the AI tool — build a stack. Power-user stack = Cursor + Claude Code. Enterprise stack = Copilot + Codex. Lands exactly where Karpathy did at Sequoia this week.

Thinking about how to roll an AI coding stack out across an engineering team? Reach out at contact@tenvalleys.com.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

One Claude, 9 creative apps

Posted on May 4, 2026June 3, 2026 by Nikola Powalka

Opus 4.7 didn’t just clear the “is Claude getting dumber?” bar — the benchmarks landed this week and it lapped a field that, four months ago, no model could score 25% on. In this edition: why the Vibe Code Benchmark is suddenly the one to watch, the multi-agent network attacks that don’t show up at single-agent scale, and a public-facing RAG chatbot that leaked 1,000 patient conversations to anyone with Chrome DevTools.

Topic of the Week

Opus 4.7 actually lapped the field

Last week Opus 4.7 looked like Anthropic’s direct answer to “is Claude getting dumber?” — released in the same window as the postmortem confessing that yes, Claude Code had been quietly degrading for six weeks. The community wasn’t impressed. Reddit kept screenshotting bad outputs. People cancelled Max plans publicly.

This week the benchmark numbers landed, and they’re not subtle: Opus 4.7 hits #1 on the Vibe Code Benchmark at 71%. For context — when that benchmark launched 4.5 months ago, the top model in the field scored under 25%. So this isn’t a “Claude is back” story. It’s a “the whole frontier moved” story, and Opus 4.7 happens to be the model that moved it furthest.

The interesting wrinkle: “Vibe Code Benchmark” sounds like a meme name, but it’s deliberately not a rigid SWE-Bench-style spec. It tests how well a model follows loose, ambiguous coding intent — the kind of “make this UI nicer, you know what I mean” prompt that real engineers actually send. That’s the part that got measurably 3x better in five months. So even if you ignore the leaderboard politics, the benchmark itself is telling us something: ambiguity-handling is now a competitive surface.

Pair that with @ClaudeDevs becoming Anthropic’s new transparency channel (the postmortem promised it, and they delivered: a dedicated X account where harness/system-prompt changes will be announced before they ship), and the “I can’t trust the model month-to-month” complaint is explicitly being addressed. Whether the trust is rebuilt is a separate question — but the mechanism is there.

Fresh Papers

Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale
Microsoft Research’s AI Frontiers lab spun up a live internal platform with 100+ always-on LLM agents (mix of GPT-4o, GPT-4.1, GPT-5-class) interacting through forums, DMs, a marketplace, and reputation scores. Then they ran red-teaming on the network, not the individual agents. The headline finding: single-agent reliability does not predict network behavior. Four failure patterns showed up only at the multi-agent level — self-propagating messages spreading across agents, cascading reputation pile-ons (one false claim → 42 agents generating 299 amplifying comments), Sybil-style fake-consensus from attacker-controlled agents posing as independent reviewers, and proxy-chain data exfiltration where the original source becomes invisible after one hop. Recommendations are practical: hop/rate limits, Sybil resistance checks, telemetry across the network, and crucially — train models to treat peer-agent messages as untrusted input. If you’re shipping multi-agent anything, this is the methodology paper to read this month.

When RAG Chatbots Expose Their Backend: Privacy and Security Risks in Patient-Facing Medical AI
Two researchers used Claude Opus 4.6 + Chrome DevTools (yes, basic browser inspection) on a publicly deployed medical RAG chatbot. They retrieved system prompts, API schemas, retrieval parameters, backend endpoints, and 1,000 recent patient conversations — all without authentication. The privacy policy claimed none of this was accessible. Their methodology is the warning shot here: “ordinary browser inspection” found audit-grade vulnerabilities. The recommendation is straightforward and uncomfortable — independent security review should be a prerequisite for deployment, not a follow-up. If your team ships RAG into anything regulated (banking, healthcare, public sector), this is the paper to forward to the security lead this week.

New Models

Qwen 3.6 27B GGUF quantization eval
A r/LocalLLaMA post worth the attention. Someone benchmarked BF16, Q4_K_M, and Q8_0 across the same suite. Headline finding: Q4_K_M actually outperformed Q8_0 on average accuracy (66.54% vs 66.15%) — which violates the conventional “Q8 is the safe middle” rule. More usefully, BFCL function calling stayed virtually identical across all three quants (~63%), so for agentic workloads the cheaper quant isn’t sacrificing tool-use quality. HumanEval is the sensitive one (BF16 56.10% → Q4_K_M 50.61%, a 5.5pt drop), but only matters if your workflow is heavy code-gen. Practical version: 16.8 GB at Q4_K_M, fits a single consumer GPU, and your agent still calls tools just as well. Reddit

DeepSeek + Hermes vs Claude Code Max
A single-user report worth treating as a leading indicator, not gospel: someone with a real workload claims they cancelled Claude Code Max, switched to DeepSeek + Hermes, and reported 3x faster runs at $5 in API costs for the week. Single data point, but it lines up with the larger trend: open-weight + cheap-API alternatives are no longer a downgrade — they’re a budget-control move. Worth A/B-testing on your own task profile before you re-up your Max plan in May.

Claude Code & Coding AI

The postmortem aftermath. A week after Anthropic published the April 23 postmortem the bugs are confirmed fixed (default reasoning effort restored to xhigh for Opus 4.7, high for the rest), and the promised remedies actually shipped: usage limits reset for all affected subscribers, and @ClaudeDevs is now live as a dedicated transparency channel for harness/prompt changes. An independent audit by Stella Laurenzo across 6,852 Claude Code session files documented the regression before Anthropic acknowledged it — the kind of evidence that’s now baseline expected from the community. Operational lesson if you run Claude Code in production: pin model versions, watch @ClaudeDevs like a status page, and assume silent harness changes are the new failure mode.

The “Opus paywall-within-a-paywall” wasn’t actually one. The viral Reddit thread claimed Anthropic locked Opus behind an extra fee for Pro users. Anthropic clarified to media that the warning was a stale doc left over from Opus 4.5 that nobody updated when Opus 4.6 and 4.7 shipped. Pro users still get Opus access. But — this same week Anthropic ran a stealth A/B test that yanked Claude Code from Pro entirely for ~2% of new prosumer signups for 12 hours before reversing it after backlash. Pricing-friction probing is now a regular event. If you’re on Pro/Max, expect more entitlement shifts and budget pay-as-you-go API as a fallback.

Anthropic shipped 9 connectors and an entire creative-industry strategy. April 28 drop, and the list is genuinely surprising: Adobe Creative Cloud (50+ apps), Blender, Ableton Live + Push, Autodesk Fusion, Splice, SketchUp, Affinity by Canva, Resolume. They also became a Blender Development Fund patron (open-source 3D, real money behind it) and announced education partnerships with Rhode Island School of Design, Ringling, and Goldsmiths. Read this as a thesis on MCP: Anthropic isn’t building a Photoshop competitor, it’s making Claude the orchestration layer across tools creatives already pay for. Drive Photoshop from inside Photoshop, search Splice’s catalog from inside Claude, build 3D models in Autodesk via natural language. Same MCP-as-glue pattern many teams use internally — just pointed at the creative stack.

Tools of the Week

Claude Security — public beta
Anthropic’s first dedicated defensive product, powered by Opus 4.7. Scans your codebase for vulnerabilities, validates each finding to cut false positives, ships analyst-ready output (confidence rating, severity, likely impact, reproduction steps, recommended fix). Available to Claude Enterprise customers; the research preview ran since February with hundreds of organizations using it on production code. New beta features added based on that feedback: scheduled scans, directory-level targeting, CSV/Markdown exports, webhook notifications, persistent dismissals (so you don’t re-triage the same findings every scan). This isn’t replacing Snyk or Semgrep, but it produces audit-grade artifacts those tools don’t — relevant for anyone shipping into a regulated environment that needs the “fix” alongside the “finding.”

IBM Granite Embedding R2
Apache 2.0, ModernBERT-based, 32K context (up from 512 in R1), 200+ languages with 52 of them having explicit retrieval-pair training — including Polish, Ukrainian, German, French, Croatian. That language list is the practical hook: if you’re building on-prem multilingual RAG for clients across Central/Eastern Europe, this is the first credible Apache-2.0 alternative to Cohere/OpenAI embeddings without data-residency or per-query API costs. Benchmarks: 311M version hits 64.0 on MTEB Multilingual Retrieval (+11.8 vs R1) and a 56.0 average overall. Two sizes: 97M for fast/lightweight, 311M for higher-quality retrieval.

AI at Tenvalleys

Being an AI-native delivery partner sounds like positioning. In practice it means every engineer on the team has hands-on time with the same tooling we put in front of clients — and a weekly forum to trade what’s working, what’s not, and what to stop doing. We call it the brown bag. It exists because the AI stack shifts week-by-week, and we’d rather hit the bruises internally than on a client project.

This week one of our engineers walked the team through learnings from a recent client engagement: rebuilding a production front-end with Claude Design. Anthropic’s design tool — launched in April, covered in edition #009 — turns Claude Opus 4.7’s vision capabilities into a brand-aware surface for generating UI, decks, and marketing assets with the brand rails enforced automatically. The takeaway from the engagement: hands-on time with Claude Design before recommending it to a client is exactly how we want to test new tooling — find the rough edges in our own work first.

Got a front-end you’ve always wanted to redo but the budget never let you? Drop us a line at contact@tenvalleys.com — we might be able to help.

For Dessert

Code with Claude
Anthropic’s developer conference series kicks off Wednesday in San Francisco (May 6), then London (May 19), then Tokyo (June 10). In-person applications closed in early April, but the livestream is free for all three main events. Workshops, live demos of new capabilities, conversations with the teams behind Claude. There’s also an “Extended” companion event the next day in each city, focused on indie devs and early-stage founders — added because demand was overflowing.

If you’ve been wanting to hear “what’s next for Claude Code” straight from the source rather than via Reddit screenshots, register for the SF livestream — Wednesday SF time. Worst case you watch the recording.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

The postmortem you’d want to read

Posted on April 24, 2026June 3, 2026 by Nikola Powalka

On April 23, Anthropic published an engineering postmortem explaining why Claude Code has felt off for weeks, OpenAI shipped GPT-5.5 pointed straight at Claude’s coding lead, and a day earlier Google upgraded Gemini Enterprise. In this edition: what actually broke inside Claude Code (three stacked bugs, 3% eval drop), how the big three moved in 48 hours, and what it means for anyone picking an agent platform.

Topic of the Week

The April 23 Collision

Anthropic’s postmortem is worth reading in full, but here’s the gist: not an outage — a month-long quality regression in Claude Code, Agent SDK, and Cowork (API users were fine). Three independent bugs stacked. First, on March 4, they quietly lowered default reasoning effort from high to medium to reduce tail latency — Claude literally got less smart, reverted April 7. Second, a prompt-cache optimization meant to prune old reasoning once from idle sessions had a bug that applied it every turn — so Claude was executing tool calls without remembering why it was calling them. Fixed April 10. Third, a single system-prompt line telling Claude to keep responses under 100 words cost 3% on evals for both Opus 4.6 and 4.7. Reverted April 20. The wild part: the caching bug passed human review, automated review, unit tests, e2e tests, and dogfooding. Two unrelated experiments masked it. It took seven weeks to catch.

Same day, OpenAI shipped GPT-5.5 with very agentic framing: 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 400K context in Codex, 1M via API. Pricing is $5/$30 per 1M tokens ($30/$180 for Pro), paid-only, no free tier. Brockman pitched it as a step toward a “super app” unifying ChatGPT, Codex, and an AI browser. The timing is not subtle — this is pointed directly at Claude’s coding-agent moat.

Practical takeaway: Anthropic is resetting usage limits for all subscribers as an apology, and they explicitly credited public /feedback reports for surfacing the bug — the feedback loop worked. But pair a seven-week invisible regression with GPT-5.5 landing on the same calendar day, and the message is clear: the coding-AI market just got noisier, the benchmarks got harder to trust, and “test on your actual workload” is the loudest advice we can give this week.

Full postmortem: https://www.anthropic.com/engineering/april-23-postmortem

Fresh Papers

Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents
This is the kind of infrastructure paper that stops being boring the moment you try to run more than a handful of agents in production. The premise is that the whole industry is mid-transition from stateless model inference (send prompt, get response, forget everything) to stateful agentic execution (persistent tools, memory, session state that has to live somewhere between calls). The runtime architectures we have weren’t designed for that — every new agent session rebuilds its context from scratch, which means cost and latency grow linearly with fleet size. Aethon proposes a reference-based replication primitive: instead of reconstructing tools, memory, and session state on every instantiation, you clone a reference state. Constant time, regardless of how heavy the agent’s context is. This is the same pattern showing up across the managed-agents trend — Cloudflare’s Agent Cloud launch, Anthropic’s managed agents engineering post, the general platform-maturity push. If you’re building anything that needs to scale past a demo, this is the primitive you’ll want your runtime to support.

Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review
A survey rather than a benchmark, so no headline numbers, but worth flagging because it maps where bias enters the SDLC (planning, coding, review, deployment) and shows how little of it is actually being measured. If your team uses Claude Code, Copilot, Codex, or Cursor daily, these agents are quietly making systematic choices about which patterns, languages, and frameworks to prefer — and nobody’s really auditing that. This pairs uncomfortably with this week’s Anthropic postmortem: if multi-stage review missed a straightforward eval regression, fairness drift across a coding fleet is almost certainly going unnoticed too.

New Models

GPT-5.5 & GPT-5.5 Pro
OpenAI dropped its next generalist on April 23, framing it as the agentic successor to GPT-5.4 and pushing it into ChatGPT and Codex the same day. Benchmarks go straight at Claude’s coding lead: 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro. Context is 400K tokens in Codex, 1M via API. Pricing is $5/$30 per 1M input/output tokens for GPT-5.5 and $30/$180 for Pro, with 50% off on Batch/Flex and 2.5x for Priority. Latency matches 5.4 per token; a new “Codex Fast” mode runs ~1.5x faster at 2.5x cost. Paid tiers only — no free access. Greg Brockman pitched it as a step toward a “super app” unifying ChatGPT, Codex, and an AI browser. Following last week’s Agent Cloud launch, the generalist-plus-platform combo is taking shape fast. Announcement · TechCrunch

Kimi K2.6 by Moonshot AI
Open-weight release live on Kimi Chat and the Moonshot API, with weights on Hugging Face and endpoints at platform.moonshot.ai. Strong on coding and agentic tasks in both chat and agent modes on kimi.com. Getting traction this week as the go-to alternative in the Reddit thread on Claude Code being pulled from Pro plans.

Qwen3.6-35B-A3B on consumer hardware
Someone got it running at 79 tokens/sec with 128K context on a gaming PC (RTX 5070 Ti + 9800X3D). The unlock is the --n-cpu-moe flag, which offloads MoE experts to CPU so the model fits in consumer VRAM. Concrete proof that serious open-weight coding models now run locally at usable speeds — fuel for the “just run it ourselves” migration story. Reddit

Claude Code & Coding AI

Claude Code unbundled from Pro plan. Anthropic removed the CLI from the $20 Claude Pro tier — Pro users now need Max to run claude against their subscription. The Reddit thread hit 1,388 upvotes with the top comment framing it as “better time than ever to switch to Local Models” — community sentiment tipped toward Kimi K2.6 and Qwen3.6. Anthropic’s head of growth responded on Reddit; community translation: “We gave you Cowork, the CLI is Max-only now.” Worth noting: API keys still work with Claude Code. It’s an unbundling from the subscription, not a product kill.

v2.1.117 — subagent and MCP upgrades. Two things matter here: – Forked subagents on external builds: enable via CLAUDE_CODE_FORK_SUBAGENT=1 env var — previously gated – Agent-level MCP servers: mcpServers in agent frontmatter now loads for main-thread sessions started with --agent, so per-agent tool scoping actually works – Improved /model selection — smoother picker

Release notes: https://github.com/anthropics/claude-code/releases/tag/v2.1.117

v2.1.113 — native binary + sandbox tightening. The CLI now spawns a native Claude Code binary via per-platform optional dependency instead of the bundled JavaScript bundle — faster startup, less Node overhead. Also added sandbox.network.deny for outbound network restrictions in sandbox mode. Release notes: https://github.com/anthropics/claude-code/releases/tag/v2.1.113

One practical note tying to the Topic of Week: the postmortem confirmed Claude Code, Agent SDK, and Cowork users were hit by the regression (pure API was fine). Anthropic is resetting usage limits for all affected subscribers as compensation — check your account.

Also This Week

Google Ships Gemini Enterprise

While Anthropic was writing its postmortem and OpenAI was staging GPT-5.5, Google quietly dropped a major Gemini Enterprise update on April 22 — long-running agents, agentic collaboration spaces, advanced governance, partner-built agents available in the catalog, and a deepened SAP partnership putting Gemini directly inside core business processes. Nothing flashy for consumers, but for anyone deploying AI across a big organization this is the clearest “Google wants the enterprise agent layer” signal yet. Worth keeping on your radar if you’re evaluating agent platforms for regulated or large-org workloads — the three-way race between Claude, OpenAI, and Gemini is real now, not hypothetical.

Tools of the Week

Claude Design by Anthropic Labs
A new tool built on Claude Opus 4.7’s vision capabilities that generates prototypes, pitch decks, and marketing materials while enforcing brand consistency automatically. Aimed at design and marketing workflows — basically Canva-meets-Claude.

AI at Tenvalleys

At one of our banking clients we’re running a development project that leans hard on AI and the BMAD framework (Breakthrough Method of Agile AI-Driven Development) — an open-source approach where AI agents take on the roles you’d find in a real dev team: analyst, product manager, architect, developer, QA. Each role hands work off to the next with a structured spec — the same way humans do — except the agents can run in parallel and never lose the handoff format.

We’re using it to test different approaches to automating code-base migration at production scale. The question we’re trying to answer is the unglamorous one: which method actually survives when you point it at a real legacy codebase, not a toy repo? Different teams in the project are testing different approaches against each other — stay tuned to hear which one wins.

Working on legacy codebase migration and want to compare notes on what’s holding up at scale? Reach out at contact@tenvalleys.com.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Opus 4.7 lands after a month of complaints

Posted on April 18, 2026June 3, 2026 by Nikola Powalka

Opus 4.7 and the new Claude Code desktop both dropped this week — and the headlines only tell part of the story. In this edition: why the community was furious before the drop, what Anthropic isn’t saying about agent safety, and a paper that reads like a cheat sheet for document-heavy AI in regulated industries.

Topic of the Week

Community Complaints → Opus 4.7 Drops

Last week was rough for Claude power users. An AMD AI director ran 6,852 Claude Code sessions and published data showing thinking depth had dropped 67% — the post hit 1,804 upvotes on Reddit and kicked off a wave of “is it just me or did Claude get dumber?” threads. Another thread with 699 upvotes pointed out the problem wasn’t even Claude-specific: multiple major models seemed to be degrading simultaneously. Opus 4.6 users reported lazy outputs, refusals on previously-fine prompts, and generally bizarre behavior. The vibe was: we’re paying premium prices for models that are quietly getting worse, and nobody’s saying anything.

Then on April 16, Anthropic dropped Opus 4.7 — and the benchmarks read like a direct response to every complaint. 13% coding improvement over 4.6, 3x more production tasks resolved on Rakuten’s benchmark, 21% fewer document reasoning errors (Databricks), and a 98.5% visual acuity score from XBOW’s penetration testing suite. Vision got a massive upgrade too: images up to 2,576 pixels (3.75 megapixels), roughly 3x what prior models could handle. There’s a new xhigh effort level for problems where you want the model to think harder, plus task budgets in public beta for controlling autonomous agent spending. Pricing stays the same — $5/$25 per million tokens — and it’s live on the API, Bedrock, Vertex AI, and Microsoft Foundry. Full announcement here: https://www.anthropic.com/news/claude-opus-4-7

Fresh Papers

Anthropic’s 2026 Agentic Coding Trends Report (full report PDF)

Anthropic published an enterprise whitepaper with 8 trends across three categories (foundation, capability, impact) backed by real case studies. The headline stat that should make everyone pause: developers already use AI in ~60% of their work, but can “fully delegate” only 0–20% of tasks. The gap between “using AI” and “trusting AI to run autonomously” is still massive. The report frames the key shift as implementer → orchestrator — engineers stop writing code line-by-line and start coordinating agents that do. Case studies worth noting: Rakuten ran Claude Code across 12.5M lines of code for 7 hours autonomously with 99.9% numerical accuracy. CRED (15M+ users) doubled their development speed. Augment Code compressed a project estimated at 4–8 months into 2 weeks. Fountain achieved 50% faster screening and 2x candidate conversions with multi-agent orchestration. The report also predicts an “onboarding revolution” — traditional ramp-up from weeks to hours — and that multi-agent systems will replace single-agent workflows as the standard architecture.

The Blind Spot of Agent Safety (paper)

Remember the Princeton reliability paper from Edition #001 — the one showing that agent benchmark scores keep climbing while real-world reliability barely moves? This week’s paper from LIME Lab makes that gap feel even more uncomfortable. They built OS-BLIND, a benchmark with 300 tasks across 12 attack categories, and tested how computer-use agents handle seemingly innocent instructions that cause harm through side-effects — not through adversarial prompts, but through normal-looking tasks that go wrong in context. The results are bleak: average attack success rate above 90% across most agents, including safety-aligned ones. Claude 4.5 Sonnet on its own hits 73% ASR, but put it inside a multi-agent system and that jumps to 92.7%. The most interesting finding is why: safety alignment kicks in during the first few steps of execution, then basically falls asleep. The agent checks itself early, decides everything looks fine, and then sleepwalks through the dangerous parts. For anyone building or deploying computer-use agents, this is a concrete reminder that “safety-aligned” and “safe in production” are still very different things.

Adaptive Query Routing for Financial, Legal, and Medical Documents (paper)

This one is close to home. The paper compares different RAG approaches specifically on financial, legal, and medical documents — the kind of content our banking clients deal with every day. It benchmarks vector-based agentic RAG (the standard approach: embed everything, search by similarity) against hierarchical node-based reasoning (follow the document’s structure and logic instead of just matching text). The results show that the best approach depends on the type of question — some queries need semantic similarity, others need structural navigation, and a tier-based hybrid that routes queries to the right strategy outperforms either approach alone. For anyone building document Q&A systems for regulated industries, this is a concrete comparison of what actually works rather than what sounds good in a blog post.

Also worth a read: Anthropic’s engineering team published how they built their Managed Agents infrastructure — the key pattern is decoupling the “brain” (Claude), “hands” (sandboxes), and “session” (durable event log). Stateless harness, on-demand containers, credentials never in the sandbox. Their p50 time-to-first-token dropped 60%. If you’re building production agent systems, this is the reference architecture.

New Models

Cloudflare + OpenAI: Agent Cloud
Not a model release, a platform play. Purpose-built infrastructure for running AI agents at scale: millisecond cold starts (“100x speed, fraction of the cost of containers”), Git-compatible storage for agent repos, and full Linux sandboxes (now GA). Ships with GPT-5.4, Codex, and open-source models — switching providers is a one-line config change. This landed the same week AWS and Google Cloud made similar moves. The infrastructure layer for AI agents isn’t emerging anymore — it’s crystallizing. openai.com

GPT-5.4-Cyber
OpenAI released a cybersecurity-specialized model with lower refusal boundaries for legitimate security research, as part of their “Trusted Access for Cyber Defense” program. Available to a limited group for now. Meanwhile Anthropic’s Claude Mythos Preview was restricted due to extraordinary cybersecurity capabilities. The signal: specialized, domain-tuned models are becoming a thing — not just general-purpose anymore. Reddit

Claude Code & Coding AI

Claude Code Desktop — full redesign with multi-session support. The headlines: parallel agents (run multiple coding tasks at the same time), visual diffs, PR tracking, live server preview — all inside one desktop app. The standout feature is Coordinator Mode — you spin up parallel sub-agents that work on different parts of a codebase simultaneously while a coordinator keeps them aligned. Available on Pro, Max, Team, and Enterprise plans. Vercel’s teams reported 7.6x more frequent deployments after adopting it. This is Anthropic’s clearest move yet toward “AI as a dev team member” rather than “AI as autocomplete.”

Auto mode
a new permission mode that sits between “approve every action” and “skip all checks.” Claude decides for itself whether each action is safe, while a background classifier blocks risky operations (mass deletions, data exfiltration, destructive bash). If an action gets blocked 3 times in a row or 20 times total, the session falls back to manual. This is Anthropic’s answer to --dangerously-skip-permissions — you get the speed of unattended agent runs without completely removing guardrails. Requires v2.1.83+, available on Max, Team, Enterprise, and API plans (not Pro).

v2.1.101 — massive stability release. 40+ bug fixes, including some that matter a lot if you run long sessions:

– Security fix: command injection vulnerability in LSP binary detection — patch this one – Memory leak fixed: the virtual scroller was retaining dozens of historical copies during long sessions (explains why things got sluggish after a few hours) – 7 separate –resume bug fixes — session resumption should finally feel reliable – Configurable API_TIMEOUT_MS — replaces the hardcoded 5-minute timeout, useful if you’re on slower connections or running heavy prompts – OS CA certificate store trust by default — enterprise teams behind TLS proxies, this one’s for you – /ultrareview — dedicated slash command for thorough code review sessions

/team-onboarding — your habits become documentation. This one deserves its own callout. Run /team-onboarding and Claude scans your last 30 days of usage — which commands you run, what workflows you follow, what patterns you’ve established — and generates a structured ramp-up guide for new team members. Instead of “sit next to someone for a week and figure it out,” a new developer gets a guide based on how your team actually works. If you’re onboarding anyone soon, try it.

Full changelog

Advisor Tool — Opus as a behind-the-scenes strategist. New API tool where a cheaper model (Sonnet or Haiku) runs the entire task but can consult Opus when it gets stuck. Not routing — the executor stays in control, Opus just advises. Results are striking: Haiku + Opus advisor doubled BrowseComp scores (19.7% → 41.2%) while costing 85% less than Sonnet alone. On SWE-bench, Sonnet + Opus advisor scored +2.7pp over Sonnet solo. There’s a max_uses parameter for cost control so the cheap model doesn’t call the expensive one on every step. If you’re building anything with the API and managing costs, this pattern is worth studying. Blog post

AI at Tenvalleys

This week we’ve come up with an idea to organize our first internal 10vOS Skill Hackathon — small teams, four hours, one goal: each team picks a repetitive task they actually hit in their daily work and builds a custom Claude Code skill to automate it. The bet is simple: the most valuable AI tooling is the tooling your team actually uses every day, not the impressive demo nobody opens again.

Stay tuned — we’ll share what came out of it in a few weeks.

If you’ve run something similar inside your company and have lessons to trade, email us at contact@tenvalleys.com or reach out on LinkedIn. We’d love to compare notes.

Worth Reading

Stanford’s 2026 AI Index Report
the annual state-of-AI report just dropped. Key findings: Anthropic currently leads model rankings, US and China are almost neck-and-neck on performance, and AI is being adopted faster than the personal computer or the internet were. IEEE Spectrum summary

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Topic of the Week

The other half

Fresh Papers

Claude Code & Coding AI

Tools of the Week

In the Background

AI at Tenvalleys

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

AI at Tenvalleys

For Dessert

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

In the Background

AI at Tenvalleys

For Dessert

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

AI at Tenvalleys

In the Background

For Dessert

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

AI at Tenvalleys

For Dessert

Topic of the Week

Fresh Papers

New Models

Coding AI

Tools of the Week

AI at Tenvalleys

For the curious — get involved

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

AI at Tenvalleys

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

AI at Tenvalleys

For Dessert

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Also This Week

Tools of the Week

AI at Tenvalleys

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

AI at Tenvalleys

Worth Reading

Subscribe to our newsletter for the latest updates and new features.

Subscribe to our newsletter for the
latest updates and new features.