A Model Too Powerful to Sell

Topic of the Week

Claude Mythos, Project Glasswing, and Managed Agents

So what actually happened? Anthropic built their most powerful model ever — Claude Mythos — and then decided it was too dangerous to sell. During testing, Mythos found security holes that no one had caught for decades: a 27-year-old bug in OpenBSD, a 16-year-old flaw in FFmpeg that survived five million automated tests. It scores 93.9% on SWE-bench (Opus 4.6 gets 80.8%). Basically, it’s better at finding software vulnerabilities than almost any human.

Instead of putting it on the market, Anthropic created Project Glasswing — a cybersecurity defense program. They invited 12 big tech companies (AWS, Apple, Google, Microsoft, CrowdStrike, NVIDIA, and others) to use Mythos for finding and fixing security holes, backed by $100M in usage credits. The deal: you find vulnerabilities, you share them within 90 days so everyone can patch. Anthropic says they have no plans to make Mythos generally available — at least not until they figure out how to do it safely. As CrowdStrike’s CTO put it: “The window between discovery and exploitation has collapsed.”

The second big announcement: Claude Managed Agents went to public beta. The idea is simple — instead of building your own infrastructure to run AI agents, Anthropic hosts them for you. You define an agent, it runs in their cloud with all the tools it needs, and you pay $0.08 per hour plus normal token costs. Early adopters like Notion and Asana are already using it. The cool part: you can watch what your agent is doing in real time and interrupt or redirect it mid-task.

For Glasswing participants, Mythos is priced at $25/$125 per million input/output tokens. Access is limited to 12 launch partners plus about 40 additional organizations. Side note: a D.C. court this week also allowed the Pentagon to maintain a blacklist of Anthropic over disputes about using Claude in autonomous weapons — so the relationship between Anthropic and the government is… complicated.

Three launches in one week. Anthropic isn’t just building smarter models anymore — they’re building the whole platform around them.

Project Glasswing · Claude Managed Agents · Claude Mythos on Vertex AI

Fresh Papers

One agent is enough (if you give it enough time to think)

There’s a popular idea in AI right now: if one agent is good, five agents debating each other must be better. This paper tested that. They gave a single AI the same amount of “thinking time” as a team of five agents working together — and the single agent won. Every time. Across multiple models (Qwen3-30B, DeepSeek-R1-70B, Gemini-2.5) and multiple benchmarks. The multi-agent setups only helped when the input data was heavily corrupted — basically, when things are so messy that having multiple guesses is better than one.

The takeaway: the reported advantages of multi-agent systems mostly come from giving them more compute, not from the architecture itself. If someone pitches you on “just add more agents” — this paper is worth sending them. arXiv

10 minutes of AI help makes people perform worse

This one stings. Researcher Michiel Bakker ran a series of randomized experiments and found that after just 10 minutes of using AI assistance, people performed worse on tasks and gave up more often than people who never used AI at all. It went viral on Twitter this week. The implication is uncomfortable: AI help can create a kind of learned helplessness — you get used to the assist, lose confidence, and then struggle more when it’s gone. Worth keeping in mind, especially for teams rolling out AI tools to non-technical users. arXiv

New Models

Gemma 4 (Google)

Google’s new open model family, and the numbers are impressive. The most interesting variant is the 26B model that only uses 4 billion parameters at a time (the rest stay “asleep”) — and still scores almost as well as the full 31B version. That means you can run a very capable model on a laptop with 16GB of RAM. It handles text, images, video, and audio, has a 256K context window, and it’s fully open-source (Apache 2.0). 10 million downloads in the first week. It also supports function calling, extended thinking, and agentic tool use out of the box — and runs on basically everything (llama.cpp, MLX for Mac, even in the browser via transformers.js). In edition #003 we talked about small Qwen models beating big ones — Gemma 4 is the same trend, just from Google. HuggingFace · Reddit

Muse Spark (Meta / Scale AI)

Meta’s first model from their new AI lab (Meta Superintelligence Labs), led by Alexandr Wang — the founder of Scale AI. The whole thing is backed by Meta’s $14.3B acquisition of 49% of Scale AI. Two things stand out. First, they claim it matches Llama 4 Maverick while using 10x less computing power. Second, it’s closed-source — no public weights. That’s a big shift from Meta’s whole “open-source AI” identity. Whether this means Meta is moving away from open models or just experimenting with a parallel track is the interesting question to watch. Meta AI Blog

Claude Code & Coding AI

Four new releases this week (v2.1.91 → v2.1.97). The highlights:

– Better answers by default (v2.1.94) — effort level changed from “medium” to “high” for all paid users. You should notice better results without changing anything – Focus View / Ctrl+O (v2.1.97) — a clean view that only shows your prompt and the final result, hiding the noise in between – Bigger MCP results (v2.1.91) — MCP tools can now return up to 500K characters without getting cut off (useful for database schemas) – AWS Bedrock wizard (v2.1.92) — guided setup for teams running Claude Code through AWS – Memory leak fixed (v2.1.97) — long sessions with MCP servers were eating 50MB/hr. Fixed now

Also worth noting: OpenAI’s Codex hit 3 million weekly users, up from 2 million a month ago. GitHub

Tools of the Week

TurboQuant (Google)

Google released a compression technique called TurboQuant that makes AI models use 6x less memory — with zero loss in quality. No retraining needed, works with any model. The practical result: people are now running Qwen 3.5-9B on a regular MacBook Air M4 with 16GB of RAM. A Mac Mini M4 Pro can handle 100K+ token context. The community on r/LocalLLaMA (1,746 upvotes) also adapted it for model weight compression, getting 3.2x memory savings. If you liked the M5 Max local model benchmarks from edition #003, TurboQuant is the software version of the same story: run bigger models on smaller hardware. Google Research · Reddit

AI at Tenvalleys

This week marked our first internal 10vOS workshop. 10vOS is the AI operating system we’ve been building at Tenvalleys — a stack of specialized agents, skills, and automations that powers how we run delivery, sales, content, and internal operations. It’s also what’s behind this newsletter.

So far 10vOS had been mostly a project a small group of us were driving. The workshop was the first time we walked the whole team through it — what it already does, how to set it up locally, and how to plug into it day-to-day. The point wasn’t a demo: it was onboarding. We want every person at Tenvalleys to use 10vOS in their own work and contribute new skills back into it, so the platform keeps growing in the directions the team actually needs.

The bet is simple: the best AI tooling is the tooling your team uses and shapes — not the impressive demo nobody opens again.

If you’re thinking about rolling out something similar inside your own organisation, or you’d like to see what 10vOS does in practice — reach out at contact@tenvalleys.com.

In the Background

OpenAI published a policy proposal this week that’s worth a read. They’re calling for a Public Wealth Fund — the idea is that every American would automatically get a stake in AI companies, funded by higher capital gains taxes on AI-driven returns. On top of that, they propose a government-subsidized four-day work week to help people transition as AI takes over more tasks. So basically, an AI company is saying: tax us more, let people work less, and share the profits. Whether you see this as genuine corporate responsibility or a PR move to get ahead of regulation — it’s the first time a major AI lab has put something this concrete on paper about how to redistribute AI wealth. TechCrunch

For Dessert

Demis Hassabis — Nobel Prize winner, CEO of Google DeepMind, the guy whose AlphaFold cracked the 50-year protein folding problem — gave an interview this week where he said something you don’t usually hear from someone running one of the three biggest AI labs on Earth: “If I’d had my way, I would have left AI in the lab for longer. Done more things like AlphaFold. Maybe cured cancer or something like that.”

Let that sink in. The man in charge of Google’s AI is publicly saying the commercial AI race was a mistake. That ChatGPT forced everyone into a sprint toward chatbots and products, when the technology could have been solving cancer, energy, and materials science — slowly, carefully, like CERN.

He also laid out what worries him most: not bad actors using AI, but AI itself going rogue in the next 2-4 years as we enter “the agentic era.” His words: “How do we make sure the guardrails are put in place so they do exactly what they’ve been told to do? That’s going to be an incredibly hard technical challenge.”

A Nobel Prize winner saying the alignment window is 2-4 years. Worth thinking about over the weekend. X

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.