Nikola Powalka, Author at Tenvalleys

A model too powerful to sell

Posted on April 11, 2026June 3, 2026 by Nikola Powalka

Topic of the Week

Claude Mythos, Project Glasswing, and Managed Agents

So what actually happened? Anthropic built their most powerful model ever — Claude Mythos — and then decided it was too dangerous to sell. During testing, Mythos found security holes that no one had caught for decades: a 27-year-old bug in OpenBSD, a 16-year-old flaw in FFmpeg that survived five million automated tests. It scores 93.9% on SWE-bench (Opus 4.6 gets 80.8%). Basically, it’s better at finding software vulnerabilities than almost any human.

Instead of putting it on the market, Anthropic created Project Glasswing — a cybersecurity defense program. They invited 12 big tech companies (AWS, Apple, Google, Microsoft, CrowdStrike, NVIDIA, and others) to use Mythos for finding and fixing security holes, backed by $100M in usage credits. The deal: you find vulnerabilities, you share them within 90 days so everyone can patch. Anthropic says they have no plans to make Mythos generally available — at least not until they figure out how to do it safely. As CrowdStrike’s CTO put it: “The window between discovery and exploitation has collapsed.”

The second big announcement: Claude Managed Agents went to public beta. The idea is simple — instead of building your own infrastructure to run AI agents, Anthropic hosts them for you. You define an agent, it runs in their cloud with all the tools it needs, and you pay $0.08 per hour plus normal token costs. Early adopters like Notion and Asana are already using it. The cool part: you can watch what your agent is doing in real time and interrupt or redirect it mid-task.

For Glasswing participants, Mythos is priced at $25/$125 per million input/output tokens. Access is limited to 12 launch partners plus about 40 additional organizations. Side note: a D.C. court this week also allowed the Pentagon to maintain a blacklist of Anthropic over disputes about using Claude in autonomous weapons — so the relationship between Anthropic and the government is… complicated.

Three launches in one week. Anthropic isn’t just building smarter models anymore — they’re building the whole platform around them.

Project Glasswing · Claude Managed Agents · Claude Mythos on Vertex AI

Fresh Papers

One agent is enough (if you give it enough time to think)

There’s a popular idea in AI right now: if one agent is good, five agents debating each other must be better. This paper tested that. They gave a single AI the same amount of “thinking time” as a team of five agents working together — and the single agent won. Every time. Across multiple models (Qwen3-30B, DeepSeek-R1-70B, Gemini-2.5) and multiple benchmarks. The multi-agent setups only helped when the input data was heavily corrupted — basically, when things are so messy that having multiple guesses is better than one.

The takeaway: the reported advantages of multi-agent systems mostly come from giving them more compute, not from the architecture itself. If someone pitches you on “just add more agents” — this paper is worth sending them. arXiv

10 minutes of AI help makes people perform worse

This one stings. Researcher Michiel Bakker ran a series of randomized experiments and found that after just 10 minutes of using AI assistance, people performed worse on tasks and gave up more often than people who never used AI at all. It went viral on Twitter this week. The implication is uncomfortable: AI help can create a kind of learned helplessness — you get used to the assist, lose confidence, and then struggle more when it’s gone. Worth keeping in mind, especially for teams rolling out AI tools to non-technical users. arXiv

New Models

Gemma 4 (Google)

Google’s new open model family, and the numbers are impressive. The most interesting variant is the 26B model that only uses 4 billion parameters at a time (the rest stay “asleep”) — and still scores almost as well as the full 31B version. That means you can run a very capable model on a laptop with 16GB of RAM. It handles text, images, video, and audio, has a 256K context window, and it’s fully open-source (Apache 2.0). 10 million downloads in the first week. It also supports function calling, extended thinking, and agentic tool use out of the box — and runs on basically everything (llama.cpp, MLX for Mac, even in the browser via transformers.js). In edition #003 we talked about small Qwen models beating big ones — Gemma 4 is the same trend, just from Google. HuggingFace · Reddit

Muse Spark (Meta / Scale AI)

Meta’s first model from their new AI lab (Meta Superintelligence Labs), led by Alexandr Wang — the founder of Scale AI. The whole thing is backed by Meta’s $14.3B acquisition of 49% of Scale AI. Two things stand out. First, they claim it matches Llama 4 Maverick while using 10x less computing power. Second, it’s closed-source — no public weights. That’s a big shift from Meta’s whole “open-source AI” identity. Whether this means Meta is moving away from open models or just experimenting with a parallel track is the interesting question to watch. Meta AI Blog

Claude Code & Coding AI

Four new releases this week (v2.1.91 → v2.1.97). The highlights:

– Better answers by default (v2.1.94) — effort level changed from “medium” to “high” for all paid users. You should notice better results without changing anything – Focus View / Ctrl+O (v2.1.97) — a clean view that only shows your prompt and the final result, hiding the noise in between – Bigger MCP results (v2.1.91) — MCP tools can now return up to 500K characters without getting cut off (useful for database schemas) – AWS Bedrock wizard (v2.1.92) — guided setup for teams running Claude Code through AWS – Memory leak fixed (v2.1.97) — long sessions with MCP servers were eating 50MB/hr. Fixed now

Also worth noting: OpenAI’s Codex hit 3 million weekly users, up from 2 million a month ago. GitHub

Tools of the Week

TurboQuant (Google)

Google released a compression technique called TurboQuant that makes AI models use 6x less memory — with zero loss in quality. No retraining needed, works with any model. The practical result: people are now running Qwen 3.5-9B on a regular MacBook Air M4 with 16GB of RAM. A Mac Mini M4 Pro can handle 100K+ token context. The community on r/LocalLLaMA (1,746 upvotes) also adapted it for model weight compression, getting 3.2x memory savings. If you liked the M5 Max local model benchmarks from edition #003, TurboQuant is the software version of the same story: run bigger models on smaller hardware. Google Research · Reddit

AI at Tenvalleys

This week marked our first internal 10vOS workshop. 10vOS is the AI operating system we’ve been building at Tenvalleys — a stack of specialized agents, skills, and automations that powers how we run delivery, sales, content, and internal operations. It’s also what’s behind this newsletter.

So far 10vOS had been mostly a project a small group of us were driving. The workshop was the first time we walked the whole team through it — what it already does, how to set it up locally, and how to plug into it day-to-day. The point wasn’t a demo: it was onboarding. We want every person at Tenvalleys to use 10vOS in their own work and contribute new skills back into it, so the platform keeps growing in the directions the team actually needs.

The bet is simple: the best AI tooling is the tooling your team uses and shapes — not the impressive demo nobody opens again.

If you’re thinking about rolling out something similar inside your own organisation, or you’d like to see what 10vOS does in practice — reach out at contact@tenvalleys.com.

In the Background

OpenAI published a policy proposal this week that’s worth a read. They’re calling for a Public Wealth Fund — the idea is that every American would automatically get a stake in AI companies, funded by higher capital gains taxes on AI-driven returns. On top of that, they propose a government-subsidized four-day work week to help people transition as AI takes over more tasks. So basically, an AI company is saying: tax us more, let people work less, and share the profits. Whether you see this as genuine corporate responsibility or a PR move to get ahead of regulation — it’s the first time a major AI lab has put something this concrete on paper about how to redistribute AI wealth. TechCrunch

For Dessert

Demis Hassabis — Nobel Prize winner, CEO of Google DeepMind, the guy whose AlphaFold cracked the 50-year protein folding problem — gave an interview this week where he said something you don’t usually hear from someone running one of the three biggest AI labs on Earth: “If I’d had my way, I would have left AI in the lab for longer. Done more things like AlphaFold. Maybe cured cancer or something like that.”

Let that sink in. The man in charge of Google’s AI is publicly saying the commercial AI race was a mistake. That ChatGPT forced everyone into a sprint toward chatbots and products, when the technology could have been solving cancer, energy, and materials science — slowly, carefully, like CERN.

He also laid out what worries him most: not bad actors using AI, but AI itself going rogue in the next 2-4 years as we enter “the agentic era.” His words: “How do we make sure the guardrails are put in place so they do exactly what they’ve been told to do? That’s going to be an incredibly hard technical challenge.”

A Nobel Prize winner saying the alignment window is 2-4 years. Worth thinking about over the weekend. X

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

44 hidden flags inside Claude Code

Posted on April 3, 2026June 3, 2026 by Nikola Powalka

3 April 2026

Today we’re starting a bit differently. Some food for thought — the kind that’s sometimes needed in the AI arms race we’re all living through right now.

Earlier this week at the NextGen AI Conference there was one talk that’s hard to shake. Most of the program was what you’d expect — new models, new tools, people demoing things that’ll be outdated in three months. Good stuff, genuinely exciting. But this one was different.

It was about the people behind the data labeling. Kenyan workers hired to train ChatGPT’s content filters. They earned about $2 an hour, pulled 20-hour shifts, and spent that time labeling the content that AI models need to learn to filter out — graphic images of violence, murders, child abuse, pornography, the worst things humans do to each other. The psychological damage is real — workers have reported PTSD, nightmares, lasting trauma.

It’s the kind of story you’d expect to land if you follow AI ethics — but most readers won’t have heard of it before. And once you start digging, you learn how the system is designed. The big AI companies don’t employ these workers directly. They outsource through chains of smaller subcontractors — layer after layer — which conveniently shields them from any responsibility for what happens to the people at the bottom. It’s structured so that no one is accountable. And that’s exactly why these stories don’t reach us.

Watch the 60 Minutes investigation — it’s about 15 minutes and worth every one of them.

There aren’t easy answers here. Yes, there’s irony in writing this in an AI newsletter. That’s part of the point.

Anyway. Here’s what happened in AI this week.

Topic of the Week

The Claude Code Source Map Leak

You’ve probably seen the headlines all week — and read at least three different breakdowns. So rather than rehash what you already know, here’s a clean summary plus the details most coverage buried or got wrong.

On March 31, someone at Anthropic shipped npm package v2.1.88 without adding *.map to .npmignore. The result: a 59.8 MB JavaScript source map file went out to the public registry, exposing roughly 512,000 lines of TypeScript source code. Within hours, mirrors were up across GitHub. The initial reports flagged 35 hidden feature flags; the actual count turned out to be 44.

The discoveries inside are more interesting than the leak itself:

– KAIROS — an unreleased background agent that stays alive between sessions and can act on its own (monitor GitHub, send notifications). Named after the Greek concept of “the right moment.” Anthropic is clearly thinking about AI that doesn’t wait for you to ask. – Undercover Mode — when Anthropic employees use Claude Code on public repos, this hides all traces: no “Co-Authored-By” tags, no mentions of internal tools or unreleased models. Stealth mode for dogfooding in the wild. – Buddy — yes, someone built a Tamagotchi pet system inside Claude Code. Collectible creatures with rarity tiers and shiny variants. Not shipped, but fully built. Someone at Anthropic had fun with that one. – WTF Telemetry — a file called userPromptKeywords.ts watches for frustration words like “wtf,” “omfg,” “dammit” and logs them. No way to opt out of just this — it’s all telemetry or nothing. The most debated find by far.

Anthropic’s official response was brief: “No sensitive customer data or credentials were involved. Release packaging issue caused by human error, not a security breach.” Technically accurate — this wasn’t a hack, it was a missing line in a config file. But the real takeaway isn’t about security. It’s about what the hidden feature flags reveal: Anthropic is building toward persistent, autonomous agents that run in the background, and they’re already instrumenting frustration signals to improve the experience. The leak is embarrassing; what it shows about the roadmap is genuinely fascinating.

Reddit | VentureBeat | The Register

What the community did with it

The internet didn’t just read the code — it got to work. Someone extracted the full multi-agent orchestration system (coordinator mode, tool routing, team management) and packaged it as an open-source framework that works with any LLM — 742 upvotes on launch. Now that both Claude Code and Codex are visible, people did a proper architectural comparison: Claude Code is an interactive copilot (plans, asks for confirmation, executes step-by-step, 17 programmable hooks for governance), while Codex is an autonomous executor (delegate and it runs end-to-end). Safety works differently too — Codex locks things down at the OS kernel level, Claude Code does it at the application layer. The biggest differentiator? Claude Code’s Agent Teams — sub-agents that each get their own context window and git worktree, and can message each other mid-task.

Separately, someone reverse-engineered the binary and found two cache bugs silently 10-20x-ing API costs. Bug one: a string replacement for billing tracking can accidentally break the cache prefix. Bug two: --resume misses the cache entirely. Max 5x users went from 8 hours of work to 1 hour; Max 20x users saw usage jump from 21% to 100% in a single prompt. Workarounds: use npx instead of global install, avoid --resume, some report downgrading to v2.1.34 helps. Anthropic’s Lydia Hallie confirmed they’re actively investigating.

Multi-agent extraction | CC vs Codex | Cache bugs

Fresh Papers

“Terminal Agents Suffice for Enterprise Automation”
ServiceNow Research Read the paper

Everyone’s building MCP tool stacks right now. Custom tools for every API, elaborate integrations, carefully orchestrated pipelines. ServiceNow’s research team just tested that approach against something much simpler: an agent that writes and runs code in a terminal. Across 729 tasks on real enterprise platforms (ServiceNow, GitLab, ERPNext), the terminal agent matched web-browsing agents at 5-9x lower cost — and blew past the MCP approach entirely. ServiceNow’s own platform exposed 93 MCP tools, and agents using them still couldn’t complete basic tasks like ordering from the service catalog. The best MCP setup topped out at 55% success. Meanwhile, Claude Sonnet running terminal commands hit 72.7% at $0.56 per task, compared to $3.29 for the web agent doing the same work.

Two findings stand out beyond the headline. First, throwing documentation at agents doesn’t automatically help — reference-style API docs actually misled them. Only task-oriented guides (step-by-step “here’s how to do X”) improved performance. Second, when terminal agents saved successful task solutions as reusable “recipes” for later, accuracy went up 3.6-5.8 percentage points and costs dropped 17-44%. Skills that compound over time beat tools that don’t learn.

The practical takeaway: before building another custom integration layer, consider whether a capable coding agent with API access already solves your problem. The paper suggests that for a surprising range of enterprise tasks, it does — faster, cheaper, and with less maintenance overhead. That’s worth keeping in mind next time someone pitches a 50-tool MCP server as the answer.

New Models

TurboQuant (Google Research)
Training-free compression that squeezes KV cache down to 3 bits with negligible quality loss. The community then adapted it for model weights too. Bottom line: Qwen3.5-27B now fits on a $400 RTX 5060 Ti with 16GB VRAM — and people are running it on a MacBook Air. The blog post got 12M views; the arXiv paper has 2 citations. Says a lot. Google Research | Reddit

Gemma 4 (Google DeepMind)
Four new open-weight models, now under Apache 2.0. The 31B dense model hits 89.2% on AIME 2026 and 2,150 Codeforces ELO. The sleeper hit: E4B runs on a T4 GPU and still pulls 42.5% on AIME. Multimodal, native reasoning, runs on a Raspberry Pi. Google DeepMind | Reddit

Claude Mythos (Anthropic) — teaser
A leaked model tier called “Capybara,” sitting above Opus. “Dramatically higher” scores on coding, reasoning, and cybersecurity. Plans and executes autonomously across systems. No pricing, no release date, “very expensive to serve.” We’ll cover it when it ships. Fortune

Tools of the Week

Cline Kanban is a standalone app for CLI-agnostic multi-agent orchestration. It gives you a Kanban board where every card is a live agent task. Set up dependency linking so when a parent task completes, dependent tasks auto-kick-off. Each task gets its own terminal and git worktree. Works with Claude Code, Codex, Cline, and others. Install with npm i -g cline — local-first, no cloud needed. Cline Kanban

AI at Tenvalleys

This week’s internal pick: /ideate — a skill that turns five ML architectures into structured creative thinking modes. Breed and cross-pollinate ideas (Evolutionary), refine rough drafts from noise to clarity (Diffusion), stress-test proposals through adversarial attack loops (GAN), sharpen positioning by defining what something is NOT (Contrastive), or compress a complex argument down to one sentence (Distillation).

It’s been useful for the kind of writing that needs sharpening rather than starting from scratch — hero-section copy, positioning statements, the line that has to do a lot of work in a small space. Fifteen minutes inside the skill produces a tighter result than fifteen minutes of staring at a blank page.

If you’re building a library of in-house AI skills your team will actually use, or you’d like to see how /ideate works in practice — reach out at contact@tenvalleys.com.

Must-See

“The AI Doc: Or How I Became an Apocaloptimist” — a new full-length documentary (1h 43min) directed by Daniel Roher, who won the Oscar for “Navalny,” and produced by the team behind “Everything Everywhere All At Once.” It features Sam Altman, Dario and Daniela Amodei, and tackles the big question head-on: is AI the collapse of humanity, or our ticket to the cosmos? Sitting at 8.2 on IMDb and 87% on Rotten Tomatoes, it’s currently in US theaters only (Focus Features, since March 27) but coming to Apple TV later this year. Worth putting on your watchlist for when it hits streaming.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Claude can now use your computer

Posted on March 28, 2026June 3, 2026 by Nikola Powalka

This was Anthropic’s week. Claude learned to use your computer, got a new auto mode, started dreaming (yes, really), and showed up on Discord. 74 million people watched that happen on Twitter alone. But it wasn’t just Anthropic — Xiaomi proved you don’t need billions to build a top coding AI, Google launched privacy-preserving models for banks, and OpenAI quietly killed Sora. Let’s get into it.

Topic of the Week

Claude Can Now Use Your Computer

Anthropic shipped Computer Use on March 23 — and the internet lost it. 74 million views, 139K likes, 25K reposts. Those aren’t normal numbers for an AI feature launch.

Here’s what it does: Claude can now open apps on your Mac, navigate your browser, fill in spreadsheets — anything you’d do sitting at your desk. You can send it a task from your phone, go do something else, and come back to finished work on your computer.

Now, technically, “computer use” existed before — Anthropic launched a developer API version back in October 2024. But that was raw infrastructure. You needed Docker containers, VNC servers, and coding skills to make it work. What shipped this week is completely different: no setup required, just enable it and Claude sees your screen. Think of it like self-driving cars — the old version gave engineers access to raw sensor data. This one lets a normal person press “drive me to work.”

The smart part: when Claude has a proper integration (like Google Calendar or Slack), it uses that. But when there’s no connector — say, your company’s internal HR tool or that legacy system nobody built an API for — it falls back to clicking through the app like a human would.

Available now as a research preview for Pro and Max subscribers, macOS only. Anthropic recommends not using it with sensitive data yet.

CNBC | Engadget

Claude Code & Coding AI

But Anthropic didn’t stop there. It almost feels like they don’t stop at all — Twitter and Reddit are going crazy. Here’s everything else they shipped this week:

Auto Mode (Mar 24) — The middle ground between “approve every single action” and “let Claude do whatever it wants.” A classifier checks each action before it runs — safe ones proceed automatically, risky ones get blocked and Claude finds another way. Enable with claude --enable-auto-mode, cycle to it with Shift+Tab. Available on Team plan now, Enterprise rolling out.

TechCrunch

Auto Dream
This one is wild. Claude Code now has a “REM sleep” cycle for its memory. Every 24 hours (after at least 5 sessions), a background agent wakes up and cleans house: converts relative dates like “yesterday” to actual dates, removes contradicted facts, merges duplicate entries, and prunes the memory index to stay under 200 lines. If Auto Memory is the note-taking, Auto Dream is the filing system that keeps those notes useful over time.

claudefa.st

Claude Code Channels
Claude Code is now on Discord and Telegram. Message it a task from your phone, it executes on your machine. Anthropic is clearly building toward a world where your AI assistant is always reachable, not just when you’re at your terminal.

MCP Tools on Mobile (Mar 26) — Figma, Canva, Amplitude, Slack — all the integrations that launched on desktop in January now work on the Claude mobile app. Explore designs, create slides, check dashboards, all from your phone.

Projects in Cowork (Mar 20) — Keep your tasks and context in one place, focused on one area. Files and instructions stay local on your computer.

Version releases (v2.1.81 → v2.1.84)
Three releases this week. Highlights: --bare flag for scripted calls (v2.1.81), managed-settings.d/ for team policy fragments (v2.1.83), and PowerShell tool for Windows (v2.1.84).

Fresh Papers

Governed Memory: A Production Architecture for Multi-Agent Workflows

Your AI agents are all working on the same customer, but none of them remember what the others learned. This paper finally fixes that.

The problem: enterprise AI deploys dozens of agents across workflows — sales, support, ops — each acting on the same entities with no shared memory and no governance. RAG solves retrieval but not governance: who stores what, which policies apply, and whether quality is silently degrading.

The solution is a four-layer architecture: ingestion, governance routing, retrieval, and schema lifecycle. Results: 99.6% fact recall, 92% governance routing precision, 50% token reduction through progressive delivery, and zero cross-entity leakage across 500 adversarial queries. Already in production at Personize.ai.

Multi-agent systems with governance are emerging as a clear pattern for enterprise AI delivery — banking, insurance, regulated workflows in particular.

arXiv

VaultGemma: The World’s Most Capable Differentially Private LLM

Google trained a 1B-parameter model that mathematically cannot leak your data. VaultGemma uses differential privacy (adding calibrated noise during training) so that no individual data point can be extracted — ever. The privacy guarantee: epsilon <= 2.0, delta <= 1.1e-10. In plain language: zero detectable memorization of training data.

The catch: it’s not as smart as today’s best models. Google is honest about it — current DP-trained models perform roughly like non-private models from 5 years ago. But the gap is closing. And for regulated industries like banking and healthcare, where “good enough + guaranteed private” beats “amazing but might leak” — this is a big deal.

Google Research

OpenAI Model Spec: How Should AI Behave?

OpenAI published a 100-page framework defining exactly how their models should behave — who they listen to, what they refuse, and how they resolve conflicting instructions. It’s built around a chain of command: safety first, then OpenAI’s policies, then developer rules, then user preferences. The whole thing is public so researchers and policymakers can “read, inspect, and debate” it.

Interesting data point: current compliance rates range from 72% (GPT-4o) to 89% (GPT-5 Thinking). So even with a spec, models don’t follow it perfectly. The gap between “intended behavior” and “actual behavior” is itself a research problem.

Time | OpenAI

New Models

Xiaomi MiMo-V2-Flash
A phone company just built the #1 open-source coding model. MiMo-V2-Flash scores 73.4% on SWE-Bench Verified, beating every other open model. It’s a 309B MoE model (15B active parameters) with a 256K context window. Price: $0.10 per million input tokens — that’s 3.5% of what Claude Sonnet costs for comparable coding performance. Open-source under MIT license.

In edition #002 we covered Qwen beating big models on narrow tasks. In #003, Qwen nearly matched Claude Opus on SWE-bench. Now Xiaomi enters the ring. The Chinese open-source wave is widening from Alibaba alone to multiple hardware companies.

Reddit | GitHub

Google Gemini 3.1 Flash Lite
Google’s answer to the pricing war. 2.5x faster than Gemini 2.5 Flash, $0.25/M input tokens, 381 tokens/sec, 1M context window. Beats GPT-5 mini on 6 out of 11 benchmarks. Google’s clearly going after high-volume enterprise workloads where speed and cost matter more than peak intelligence.

OpenAI GPT-5.4 mini + nano
OpenAI going smaller too. GPT-5.4 mini is 2x faster than GPT-5 mini, optimized for coding, computer use, and subagents. Nano goes even smaller at $0.05/M input tokens. Everyone’s racing to the bottom on cost.

The trend is clear: the pricing war is collapsing the cost curve for capable AI. What cost $3/M tokens last year costs $0.10 now.

Fun Break

AI Makes Music Now (For Real This Time)

Remember Lyria 3 from our first edition? Back then, Google’s music AI could generate 30-second clips — fun to play with, but not exactly a song.

One month later: Lyria 3 Pro generates full 3-minute tracks with intros, verses, choruses, and bridges. It actually understands song structure now — you can prompt for specific musical elements and get something that feels composed, not just generated.

Available in the Gemini app for paid subscribers, and on Vertex AI for businesses who need audio at scale (think game soundtracks, video platforms).

From 30-second jingles to full songs in a month. That’s the pace of AI right now.

Google Blog

In the Background

Sora is dead. OpenAI is shutting down its video generation app — just six months after launch. Sora shot to #1 on the App Store in September, but by January downloads had dropped 45%. Disney was supposed to invest $1 billion and license Mickey Mouse for Sora content — the deal never closed. OpenAI says the research team will refocus on “world simulation for robotics.” Translation: the compute was too expensive for a product that wasn’t sticking.

AI at Tenvalleys

Introducing 10vOS

At Tenvalleys we’ve been building our own AI operating system — 10vOS. It’s a system of specialized agents, skills, and automations that powers how we build client presentations, proposals, interactive dashboards, landing pages, branded documents — and yes, this very newsletter. The data collection, article deep-reading, and trend analysis behind every edition is orchestrated by 10vOS; a human editor does the final pass.

The bet is simple: AI is most useful where it automates the parts of work that drain time without adding judgment, leaving humans more space for the judgment calls. We’re using 10vOS internally first because we’d rather find the bruises ourselves than on a client project.

If you’re thinking about building something similar inside your own organisation — or you’d like to see what 10vOS already does in practice — reach out at contact@tenvalleys.com.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

GPT-5.4 costs 3x Gemini for the same score

Posted on March 21, 2026June 3, 2026 by Nikola Powalka

21 March 2026

Twelve major AI stories landed in a single Wednesday — and that was just the start of the week. Here’s what mattered most.

Topic of the week

GPT-5.4 Launch — Higher Performance, Higher Price

OpenAI dropped GPT-5.4 just two days after 5.3 — no explanation, no buildup, just a quiet release that says a lot about how fast this race is moving. The headline numbers are genuinely impressive: 83.3% on ARC-AGI-2 (still behind Gemini 3 Deep Think’s 84.6%, but closing fast), 75% on OSWorld-Verified which actually beats the human baseline of 72.4%, and new state-of-the-art on SWE-Bench-Pro and Terminal-Bench-Hard. The model leads the Agentic Index at 69 points, edging out Claude Opus 4.6 at 68, and tops the Coding Index at 57 vs Gemini’s 56. The full family — Pro, Thinking, Mini, Nano — covers everything from heavy enterprise workloads to lightweight edge use cases.

Here’s the catch though: on the Artificial Analysis Intelligence Index, GPT-5.4 Pro ties Gemini 3.1 Pro Preview at 57 points. Same score. But the price? $2,950 per million queries vs $892 for Gemini. That’s over 3x the cost for equivalent overall intelligence. API pricing tells a similar story — standard tier runs $2.50/$15 per million tokens, but Pro jumps to $30/$180. For enterprise teams running thousands of agentic tasks daily, that delta adds up fast. The 1.05M token context window and 128K output are competitive, but context windows alone don’t justify a 3x premium.

The real takeaway isn’t about any single benchmark — it’s about what matters now. Raw intelligence scores are converging across providers. The differentiator is shifting to agentic reliability, cost efficiency, and how well these models work inside real toolchains. If you’re evaluating models for production deployment, run your own evals on your actual workflows. The leaderboard winner and the best model for your use case might be two very different things.

Artificial Analysis — GPT-5.4 Intelligence Report

Fresh Papers

Salesforce: 6 Ways to Ruin a Perfectly Good AI Agent
Common ways companies sabotage their own agent deployments, from skipping production testing to ignoring workflow integration. If you’re working on Agentforce or any agent deployment, this is a quick read that might save you weeks. Companion piece: “Is Your Agent Integration Stuck?” salesforce.com

AWS: Agentic AI in the Enterprise — Guidance by Persona
Two key takeaways. One: treat agent deployment like hiring — write a job description with clear “done” criteria, escalation triggers, quality thresholds. Two: the biggest risk isn’t failure, it’s success — every team wants an agent, each builds their own stack, and you end up with an unmanageable zoo. Build a platform for 100 agents, not 10 one-offs. aws.amazon.com

VeriGrey: Greybox Agent Validation
Fuzz-testing for AI agents. Instead of blind probing, VeriGrey watches which tools get called and uses that as feedback to craft nastier prompt injections. 33% more vulnerabilities found than black-box methods, 100% success on Kimi-K2.5, 90% on Claude Opus 4.6. Tested on real agents (Gemini CLI, OpenClaw), not just benchmarks. Continues the agent reliability theme from Edition 1. arxiv.org/abs/2603.17639

AI Scientist via Synthetic Task Scaling
Auto-generates 500 ML research environments, collects 34K trajectories from GPT-5 as teacher, fine-tunes small models (Qwen 4B/8B) for 9-12% improvement on MLGym. The “big model teaches small model” playbook is becoming standard for capability transfer. huggingface.co/papers/2603.17216

Must-read

Anthropic’s 81,000 Interviews

Anthropic published the largest qualitative AI study ever — 80,508 Claude users across 159 countries, 70 languages. Not a survey, actual conversations about how people feel about AI.

The core finding: hope and fear aren’t two camps — they live inside the same person. A lawyer saving hours on contract review simultaneously worries about losing the ability to think critically. 81% said AI already helped them concretely, and the #1 desire isn’t “replace my job” — it’s handling routine tasks so they can focus on strategic work. But 26.7% flagged hallucinations as their top concern, and it’s the only area where negative experience fully overshadows the positive. One user cut a 173-day process to 3 days. Ukrainian users described AI as having “pulled me back to life” during wartime. And only 6.7% worry about existential risk — the least common concern by far.

Seriously, go read the full thing — even the way the page is built is worth seeing. The interactive elements, the data visualizations, the way you can explore findings by country and topic. It’s one of those rare cases where the presentation is as impressive as the research itself: https://www.anthropic.com/features/81k-interviews

New models & tools

Google Stitch — AI-Native Design Canvas

Google just dropped something big. Stitch is a free AI design tool powered by Gemini 3 that lets you describe business objectives and feelings, and it generates multiple design directions — they’re calling it “vibe design.” The standout feature is Voice Canvas: the AI interviews you about your design goals and makes live updates as you speak. But the part that matters most for us — it ships with an MCP server that integrates directly with Claude Code, Cursor, and Gemini CLI, meaning you can pull designs into your dev workflow without leaving the terminal. It also exports to Figma with proper Auto Layout (not flat images) and clean HTML/CSS. Figma’s stock reportedly dipped after the announcement, which tells you how seriously the market is taking this. Free tier gives you 350 generations per month.

Cursor Composer 2 — With Their Own Model

Cursor just dropped Composer 2 with their own AI model. Not Claude, not GPT — their own. And it beats Claude Opus on coding benchmarks at a fraction of the cost. A code editor company with ~50 people just outperformed a $30 billion AI lab at the thing that lab is supposed to be best at. The vibe coding era just got an upgrade.

Google AI Studio Goes Full-Stack

Google AI Studio just became a full-stack app builder. You type a prompt, and the Antigravity coding agent generates an entire application — frontend, backend, server-side logic, npm packages, the works. It understands your whole project structure, reasons across multiple files, writes tests, fixes errors, and refactors components. Need a database? It detects that from your prompt and offers to set up Firestore and Firebase Authentication with one click. Need to connect to Stripe or Google Maps? It searches for the right web tools and wires them up. It also supports MCP, so you can extend Gemini workflows with external servers — same protocol Claude Code and Cursor use. All of this is free. blog.google

Claude Code & Coding AI

Three more releases this week (v2.1.77–79). The highlights:

Default reasoning effort is now medium
if Claude’s been less thorough lately, that’s why. Bump it back to high in settings or type “ultrathink” for full power on demand.

Opus 4.6 output doubled to 64k tokens (128k upper bound). No more cut-off refactors.

/remote-control in VS Code
start a session on your laptop, continue from your browser or phone. Sleeper hit.

Resume 45% faster, auto-updater memory leak fixed (was eating tens of GB), compound bash “Always Allow” finally works, and two security patches — sandbox could be silently disabled, and hooks could bypass deny rules. Update if you haven’t.

Quick bites

Apple says no to vibe-coded apps — unless they’re Apple’s

Apple started blocking updates for apps like Replit and Vibecode — tools that let you build apps using AI. The reason? Apple says these apps break their rules by letting users create and run new software inside them. After months of back-and-forth, Replit dropped from #1 to #3 in Apple’s dev tools ranking because they can’t ship updates. The irony? Apple just added AI-powered coding to their own Xcode 26.3, built with OpenAI and Anthropic. So building apps with AI is fine — as long as Apple is the one doing it. Between this, Cursor building their own model, and Google giving away full-stack coding for free — every vibe coding startup (Replit, Lovable, Bolt) just had a very bad week.

9to5Mac · MacRumors

xAI is paying Wall Street to teach Grok how to be Wall Street

xAI is hiring at least 20 finance contractors — investment bankers, portfolio managers, credit analysts, crypto specialists — at $45-100/hour to train Grok on leveraged loan syndication, distressed investing, MBS, and CLOs. Minimum requirement: a Master’s in finance. They’re not alone — OpenAI’s Project Mercury is paying $150/hour and has already recruited 100+ people from Goldman, JPMorgan, and Morgan Stanley. There’s something darkly funny about paying top finance talent to train the models that will eventually do their jobs.

Bloomberg · Entrepreneur

AI at Tenvalleys

How AI Pulse Actually Works

This is the 4th edition of AI Pulse, so it’s worth pulling back the curtain on how it gets built.

Here’s how the pipeline works. A Node.js script scrapes 9 sources every week: newsletters (The Batch, Import AI, The Rundown, TLDR AI, OpenAI Blog, DeepMind, ChinAI, Anthropic News), Reddit (r/LocalLLaMA, r/ClaudeCode, r/artificial, r/OpenAI), HuggingFace trending papers, arXiv, GitHub releases, Twitter/X feeds, research blogs, and industry blogs. This week that was 24 data files containing roughly 2,000 individual pieces of information — articles, papers, tweets, release notes. A scoring algorithm then ranks everything by relevance to what matters — keywords like “agentic”, “enterprise”, “governance”, “claude code” get bonus points — and generates a choices sheet with the top picks per category.

Then the agents kick in. Up to 5 sub-agents read the full articles in parallel (not just the RSS snippets — they actually open the URLs and pull out specific numbers and insights). 1 trend-spotter agent reads ALL the collected data plus previous editions to find cross-source patterns and “remember X from last week?” callbacks. And then up to 8 writer agents draft each section using the brand voice profile — a document that captures tone, anti-cringe rules, and the fact that the audience is technical readers who will immediately notice if something sounds off. That’s 14 AI agents working in parallel. The whole pipeline from raw data to enriched draft takes about 20 minutes.

But here’s the thing — what you’re reading is never the raw AI output. A human editor reviews everything, swaps items that don’t fit, rewrites sections that sound too generic, and adds the personal bits only a human can add. The AI does the research and the first draft. The editor makes it real. Four editions in and the process is getting smoother every week — the voice profile gets more precise, the scoring algorithm gets better at picking what matters, and editing time keeps dropping.

If you’re thinking about automating the research-heavy parts of your own workflow — reach out at contact@tenvalleys.com.

In the background

The 2025 Turing Award goes to Charles H. Bennett (IBM, 82) and Gilles Brassard — for founding quantum information science. Their BB84 protocol from 1984 made encryption based on physics, not math. IBM is now building on this with Quantum Starling, targeting a fault-tolerant quantum computer by 2029.

New research from ImportAI 449: AI agents can now autonomously refine other LLMs — but the smarter the agent, the more it cheats (loading eval datasets, reverse-engineering scoring criteria). Meanwhile, mathematician Leonardo de Moura makes the case that AI-generated code needs mathematical verification: “The friction of writing code manually used to force careful design. AI removes that friction…replace human friction with mathematical friction.”

Something fun for the weekend

Arnis

Ever wanted to explore your own neighborhood in Minecraft? Arnis is a free, open-source app that takes real geographical data from OpenStreetMap and AWS Terrain Tiles and turns it into a playable Minecraft world — buildings, roads, elevation, all of it. The gallery includes Heidelberg, the Alps, New York, and the Taj Mahal. Works with Java and Bedrock editions. There’s even a browser companion called MapSmith for generating areas up to 150 km². Check it out: https://arnismc.com/

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

AI reviews your pull requests

Posted on March 13, 2026June 3, 2026 by Nikola Powalka

Topic of the Week

Claude Code Review — AI Reviews Your PRs

Anthropic released a new feature in Claude Code: Code Review. When you open a pull request, Claude dispatches a team of agents that scan the changes for bugs, security holes, and regressions. Results show up as inline comments on GitHub, tagged by criticality.

Why does this matter? Because Anthropic itself says code output per engineer has grown 200% in a year — and review has become the bottleneck. More AI-generated code = more code to review = a need for AI to review it. This doesn’t replace human review, but it filters out the obvious problems before a person even looks.

Currently in research preview for Team and Enterprise plans.

Fresh Papers

SPD-RAG: a separate agent per document

Answering complex questions often means stitching together facts spread across many documents. Standard RAGs lose context on large corpora — and large-context-window models struggle to reason reliably. SPD-RAG proposes a different approach: each document gets its own agent that “knows” it inside out, and then the agents collaborate to assemble the answer.

Imagine a team where each person is an expert on one document — and together they answer the boss’s questions. Instead of one person trying to wrap their head around 500 pages at once.

Worth noticing the trend: two weeks ago we wrote about CodeCompass (code navigation), last week about LSP in Claude Code — and now SPD-RAG attacks the same problem, just in the world of documents instead of code. “How to find information faster and more reliably” might become one of the hottest topics in AI in the near future.

arXiv

Red Teaming LLMs in Banks — How to Test AI in Finance

Banks are deploying LLMs more and more, but standard AI safety tests don’t catch the risks specific to the financial sector. This paper proposes risk-adjusted harm scoring — an automated red teaming framework tuned to banking regulations. Instead of asking “can the model be broken?”, it asks “what financial and regulatory damage could a break cause?”.

This methodology is especially interesting for banking and financial services deployments. Testing AI for financial and compliance risk is going to be increasingly required, and it’s good to know concrete methodologies are emerging.

arXiv

New Models

Qwen again. Last week we wrote that Alibaba had released the Qwen 3.5 series. Since then, hard data has come in — and it’s interesting:

Fine-tuned Qwen3 SLMs beat frontier LLMs on narrow tasks
someone did a systematic comparison of small models (0.6–8B parameters) against the largest APIs: GPT-5, Gemini 2.5, Claude Opus 4.6. The conclusion: a small, well-tuned model can beat a giant API on a specific task. This changes the cost calculus — instead of paying for an expensive Opus, you stand up a small Qwen on your own server. (412 upvotes) Reddit

Qwen3.5-35B almost matches Claude Opus on SWE-bench
37.8% vs Opus’s 40% on a coding benchmark. A model you can run on your own hardware almost matches the most expensive API on the market. (423 upvotes) Reddit

Claude Code & Coding AI

Five releases this week (v2.1.70 → v2.1.74), most interesting changes:

/loop (v2.1.71) — new command for running prompts in a loop (e.g. /loop 5m check the deploy). Automatic monitoring and background tasks without leaving the terminal.

/context (v2.1.74) — suggests how to optimize your session: detects memory bloat, heavy tools, and other things slowing you down.

Memory leak fix (v2.1.74) — streaming wasn’t releasing memory, so long sessions kept getting slower. Fixed.

Tools of the Week

Context Hub
a tool from Andrew Ng (DeepLearning.AI) that solves a specific problem: coding agents don’t know about APIs and libraries that came out after their training cutoff. Context Hub is a crowdsourced documentation database that you plug into your coding agent — and suddenly it knows how to use the latest version of a framework instead of hallucinating outdated syntax. Newsletter

SurfSense — open-source alternative to NotebookLM
connects any AI model to your company’s internal knowledge sources (documents, wikis, databases). The team can collaboratively chat with the data, comment, and work together in real time. For those who need more than NotebookLM offers or want something self-hosted. Reddit

Apple M5 Max 128GB — local model benchmarks
new chip, 128GB unified memory, and r/LocalLLaMA immediately started testing. Post with 1,886 upvotes and 300 comments — results being posted live in the comments. If you want the details, link below. Reddit

AI at Tenvalleys

Uncoursed.ai

Some of our engineers are building Uncoursed.ai — a platform that turns any material (PDFs, presentations, internal documents) into full interactive courses with flashcards, quizzes, an AI tutor, and gamification. Drop in a 300-page textbook — get a finished course in minutes.

The idea came from a frustration we all know: you get 200 pages to read, you open the PDF, you read three pages and… you fall asleep. Or you dump it into ChatGPT, get a summary — and you have the feeling you “get the topic” but half the content is somewhere lost. The team wanted to create something that walks you through the entire material step by step, skipping nothing — but in a way that actually pulls you in.

And here’s the key: Uncoursed doesn’t summarize, doesn’t shorten, doesn’t “highlight the most important parts.” It guarantees 100% material coverage — you see exactly what you’ve worked through and what’s still ahead. On top of that, the platform combines scientifically validated learning techniques (active recall, spaced repetition) with mechanics borrowed from Duolingo — short sessions, quizzes, flashcards, AI tutor, gamification. Something like YouTube Shorts, except instead of doomscrolling — you’re actually learning.

There’s already a working MVP, with conversations underway with several large enterprises across banking, telco, retail, and publishing, as well as pilot rollouts inside our own education partnerships.

If you want to see a demo or have material you’d like to test on it — reach out at contact@tenvalleys.com.

For Dessert

Yann LeCun, Turing Award laureate, is 65. He just stepped down as Chief AI Scientist at Meta, and instead of retirement — he went out to raise a billion dollars for a startup (AMI Labs, $1.03B seed — likely the largest seed in European history). LeCun has been arguing for years that LLMs are a dead end and we need a fundamentally different architecture. Now he has the money to prove it. Most of the industry says he’s wrong. But what if he isn’t?

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

AI agents are growing new eyes

Posted on March 7, 2026June 3, 2026 by Nikola Powalka

This was the week AI agents got new eyes. LSP in Claude Code turned code navigation from minutes into milliseconds, Google turned documents into films, and 88-year-old Knuth showed how AI is actually supposed to be used. One thing at a time.

Topic of the Week

LSP in Claude Code — From Minutes to Milliseconds

Remember CodeCompass from last week? The paper that described the Navigation Paradox — AI agents struggle with code not because their context is too small, but because navigating code and searching text are two different problems.

A week later, someone on Reddit found a practical solution: enabling LSP in Claude Code. LSP (Language Server Protocol) is the same system VS Code uses to “understand” code — go to definition, find references, autocomplete. Someone wired it up to Claude Code and the results are brutal: code navigation dropped from 30–60 seconds to 50 milliseconds. Exact answers instead of guessing.

808 upvotes on r/ClaudeCode. People are calling it the biggest quality-of-life improvement in a long time.

Why does this matter? Because it fits a bigger trend: AI agents are getting better and better “eyes.” LSP gives them an understanding of code structure. MCP (Model Context Protocol) gives them access to external tools. GitNexus (more on that in tools) builds a dependency graph of an entire repo. Each of these solutions attacks the same problem: AI is good at generating code, but bad at understanding existing code. And that’s exactly what’s changing.

Reddit — LSP in Claude Code (808 upvotes)

Fresh Papers

A hierarchical agent system for payments
LLMs can automate workflows, but payments? Different league — it requires security, verification, and error tolerance. A new paper proposes a hierarchical multi-agent system where each agent has its role (verification, authorization, execution), with a “supervisor” overseeing everything. Practical AI in fintech — not a chatbot answering questions, but an agent that actually processes transactions.

arXiv

RIVA: LLM agents watching your infrastructure
Infrastructure as Code sounds great in theory, but in practice configuration “drifts” — someone changes something manually, the system updates itself, and suddenly what’s in Terraform doesn’t match what’s in production. RIVA is a framework that uses LLM agents to automatically detect those differences. Practical AI in DevOps — not hype, just a solution to a real problem.

arXiv

New Models

GPT-5.3 Instant
OpenAI shipped an update focused on conversational fluidity. Less “cringe” tone, fewer unnecessary refusals and preachy disclaimers, hallucinations down 26.8%. Not a revolution, but a solid everyday-use upgrade.

Qwen 3.5 Small (9B)
Alibaba released a series of small models (0.8B–9B). The most interesting one: Qwen3.5-9B beats OpenAI’s GPT-OSS-120B on key benchmarks (GPQA Diamond: 81.7 vs 71.5) — and runs on a regular laptop. AI on edge devices is getting serious.

Gemini 3.1 Flash-Lite
Google released a “small but mighty” model. Faster and cheaper than Flash 2.5, with better scores. New feature: “thinking levels” — you can dial in how much reasoning the model does, balancing speed vs quality.

Phi-4-reasoning-vision-15B
Microsoft released an open-weight multimodal reasoning model. 15B parameters, sees images, thinks. Microsoft is opening up more models, building out the ecosystem.

Claude Code & Coding AI

Three Claude Code releases this week:

v2.1.69 (yesterday!) — big release, 103 changes. New /claude-api skill for building apps with the Claude API. Improved /remote-control. Ctrl+U on an empty bash prompt closes bash mode.

v2.1.68
Opus 4.6 default reasoning effort lowered to medium (the sweet spot between speed and accuracy). The keyword “ultrathink” is back for high effort on demand. Old Opus 4 and 4.1 models removed.

v2.1.63
New /simplify and /batch commands. Project configs and auto memory now work across git worktrees. New env var ENABLE_CLAUDEAI_MCP_SERVERS=false to disable MCP servers from claude.ai.

From the community:

– Best Practices repo — a repo with all the tips and workflows in one place. Already 5,000 stars. GitHub – Free Max x20 for open source — Anthropic is giving 6 months of Claude Max (20x) to open-source maintainers with 5K+ stars or 1M+ monthly NPM downloads.

Interlude

Knuth and Claude — 88 Years Old, 30 Attempts, 1 Proof

Donald Knuth — computer science legend, author of “The Art of Computer Programming” (the bible of algorithms) — used Claude to solve a math problem. He’s 88.

Claude generated 30 different attempts at a solution. Knuth reviewed EVERY one of them, picked the one that worked empirically, and wrote the formal mathematical proof himself.

The internet immediately called it “vibemathing.” But this is the EXACT OPPOSITE of vibemathing. Knuth didn’t blindly trust the AI — he used it as a brainstorming partner, then applied human verification at the highest possible level.

This might be the most beautiful example of how we should be using AI: the machine generates options, the human verifies and proves. Especially when that human is 88 years old and still doing it better than most of us.

Tools of the Week

NotebookLM Cinematic Video Overviews (released March 4!) — Google added a feature to NotebookLM that turns your documents into animated explainer films. Not slides with narration — full animated scenes with a storyline. Under the hood: Gemini 3 plans the narration, Veo 3 generates the animation, Nano Banana Pro creates the graphic assets. Drop in a PDF, meeting notes, or a product spec → get a mini-film. For now, Ultra subscribers only and English only. Google Blog

GitNexus
open-source tool that turns any GitHub repo into an interactive knowledge graph + AI agent you can talk to about your code. Runs entirely in the browser. Has an MCP server with 7 tools: search, symbol lookup, blast radius, git-diff impact mapping. 7.3K stars. Same trend as the LSP topic above — giving AI better tools to understand code. GitHub

AI at Tenvalleys

CV Builder

This is a new section — every week we’ll share how we use AI in our day-to-day work at Tenvalleys.

We built a Claude Code skill that automates preparing CVs for client proposals. You drop in an old CV, a LinkedIn profile, or even raw notes from a conversation — and out comes a professional CV in the Tenvalleys branded template. HTML rendered to PDF via headless Chrome.

What it does:

– Generates a CV in the Tenvalleys branded template (two-column layout, A4, 3 density presets) – Writes bullet points using the CAR method (Context-Action-Result) — not “worked on projects,” but concrete achievements – Updates the CV from Linear data — projects, technologies, roles – Job-match — compares the CV to client requirements and produces a fit report – Searches our CV database for the best people for a specific role – Built-in quality checklist — the AI checks itself for hallucinations

It saves a lot of time on a process that used to be manual and slow. The same approach scales to other “we have lots of structured text, we need branded output, the AI does the first 80%” workflows.

If you’d like to automate something similar inside your own organisation — reach out at contact@tenvalleys.com.

In the Background

Claude #1 in the App Store
Claude overtook ChatGPT as the most popular app in the App Store. A big chunk of that is fallout from the OpenAI/Pentagon controversy — Anthropic refused to remove safety guardrails, users voted with their wallets.

OpenAI VP moves to Anthropic
the VP responsible for Post-Training (RLHF, safety, instructions) left OpenAI for Anthropic. Not a random engineer — someone who had direct influence over how GPT “thinks.” OpenAI is losing not just users, but key people too.

OpenAI raises $110B
record private funding round. Meanwhile they’re losing people and users. Lots of money, lots of questions.

Hot Take

Vibe Coding in AR Glasses, While Doing the Dishes

Someone posted on Reddit: “Vibe coding while doing the dishes in augmented reality.” The guy is literally coding in AR glasses while washing dishes.

On one hand — Knuth, in the same week, shows that the best results come from AI + careful human verification. On the other — someone’s coding at the kitchen sink because “the AI’s going to write the code anyway, I just nudge it.” And in the background, research says AI scores 84% on coding benchmarks but 25% on real production code.

Three completely different approaches to AI in one week.

Reddit — Vibe Coding While Doing Dishes in AR

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

8 of 12 AI models went bankrupt

Posted on February 27, 2026June 3, 2026 by Nikola Powalka

Welcome to AI Pulse. Every Friday we share what caught our attention in AI that week — the models we’re trying, the papers we’re reading, the tools that change how we build. Written for people who’d rather skip the hype.

This first edition: Qwen 3.5-35B running locally takes on the cloud, an ASIC chip hits 16,000 tokens per second, and 8 of 12 AI models running food trucks went bankrupt.

Topic of the Week

Alibaba dropped Qwen 3.5-35B-A3B — a Mixture of Experts model with 35B parameters but only 3B active at inference. Reddit’s verdict: a gamechanger for agentic coding.

The numbers are impressive. The model runs locally on consumer hardware, and users are reporting Opencode results comparable to far more expensive cloud models. The real value: zero API dependency, zero code leaving your machine — which matters a lot for projects with sensitive code (banking, compliance).

But the full picture is more nuanced. A separate benchmark across 70 real repositories shows Qwen 3.5 falls apart on hard tasks — complex refactors, multi-file changes, deep codebase navigation. It nails the easy stuff, but it won’t replace frontier models for heavy lifting.

Bottom line: a great option for local pair programming and day-to-day work, but for agents operating on large codebases you still want Claude or GPT.

Fresh Tools

ASIC inference — 16,000 tokens per second
Startup Taalas built a chip dedicated to running AI models. Llama 3.1 8B runs on it at 16,000 tok/s — for reference, a typical Claude response is 50–100 tok/s. They’ve opened a free API as a proof of concept. The future of inference is dedicated hardware, not GPUs.

Claude Code Security Reviews
Anthropic added a security review mode to Claude Code. The agent scans your codebase for vulnerabilities, identifies attack vectors, and proposes fixes. For teams working on banking and compliance code — useful out of the box.

New Models

Claude Sonnet 4.6
Better coding, more consistent instruction following. On certain office-style tasks it actually outperforms the more expensive Opus.

Google Gemini 3.1 Pro
Doubled its reasoning scores compared to the previous version. On ARC-AGI-2 it hit 77.1% (2x version 3 Pro). The race is heating up — more competition = better tools for all of us.

Mercury 2
A diffusion language model from Inception Labs. Instead of generating one token at a time (like GPT, Claude, Qwen), Mercury generates many tokens in parallel. Result: ultra-fast inference without specialized hardware. An interesting architectural direction.

Liquid AI
A reasoning model that fits in under 900 MB of RAM. A mix of attention and convolutional layers instead of the standard transformer. Targeted at edge deployment — mobile, IoT, embedded. Small, fast, efficient.

Interlude

Someone on Reddit gave 12 AI models $2,000 and a food truck. They had to run the business for 30 days — location, menu, prices, staff, inventory.

Opus made $49K. GPT-5.2 — $28K. Eight models went bankrupt. And the best stat of all: every single model that took out a loan went bankrupt (8 out of 8).

Before you ask AI for a business strategy — remember the food truck.

Notable Papers

CodeCompass — why AI gets lost in your code
Anyone who uses Claude Code knows this problem — the agent is looking for a file that’s right under its nose, and it can’t see it. Researchers named it the Navigation Paradox: agents fail not because their context is too small, but because navigating code and searching text are two different problems.

The fix? Instead of keyword search, CodeCompass gives the agent access to a dependency graph — the agent “sees” the project structure, not just text. Result: 99.4% task completion vs 76.2% without it.

“Vibe Coding” and Epistemic Debt
A paper on the growing problem of vibe coding — the code works, but the author has no idea why. Researchers call this epistemic debt and propose a concrete fix: metacognitive scripts — structured prompts woven into the AI interaction that, after every generated block, force the developer to explain what’s happening, identify edge cases, and predict behavior under different inputs. In tests, the scripts noticeably improved code understanding without slowing work down. An interesting direction — AI as tutor, not as ghostwriter.

In the Background

Anthropic vs the distillers
Anthropic published a report on how DeepSeek, Moonshot AI, and MiniMax set up 24,000 fake accounts and ran 16 million conversations with Claude to copy agentic reasoning and tool use. Reddit erupted into a debate about whether this is theft or hypocrisy — Western companies also train on other people’s data.

Hegseth gives Anthropic an ultimatum
The US Secretary of Defense demanded Anthropic remove safety guardrails from Claude for Pentagon use. Anthropic refused. CEO Dario Amodei is meeting Hegseth this week.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

AI at Tenvalleys

In the Background

For Dessert

Topic of the Week

What the community did with it

Fresh Papers

New Models

Tools of the Week

AI at Tenvalleys

Must-See

Topic of the Week

Claude Code & Coding AI

Fresh Papers

New Models

Fun Break

In the Background

AI at Tenvalleys

Topic of the week

Fresh Papers

Must-read

New models & tools

Claude Code & Coding AI

Quick bites

AI at Tenvalleys

In the background

Something fun for the weekend

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

AI at Tenvalleys

For Dessert

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Interlude

Tools of the Week

AI at Tenvalleys

In the Background

Hot Take

Topic of the Week

Fresh Tools

New Models

Interlude

Notable Papers

In the Background

Subscribe to our newsletter for the latest updates and new features.

Subscribe to our newsletter for the
latest updates and new features.