GPT-5.4 costs 3x Gemini for the same score

21 March 2026

Twelve major AI stories landed in a single Wednesday — and that was just the start of the week. Here’s what mattered most.

Topic of the week

GPT-5.4 Launch — Higher Performance, Higher Price

OpenAI dropped GPT-5.4 just two days after 5.3 — no explanation, no buildup, just a quiet release that says a lot about how fast this race is moving. The headline numbers are genuinely impressive: 83.3% on ARC-AGI-2 (still behind Gemini 3 Deep Think’s 84.6%, but closing fast), 75% on OSWorld-Verified which actually beats the human baseline of 72.4%, and new state-of-the-art on SWE-Bench-Pro and Terminal-Bench-Hard. The model leads the Agentic Index at 69 points, edging out Claude Opus 4.6 at 68, and tops the Coding Index at 57 vs Gemini’s 56. The full family — Pro, Thinking, Mini, Nano — covers everything from heavy enterprise workloads to lightweight edge use cases.

Here’s the catch though: on the Artificial Analysis Intelligence Index, GPT-5.4 Pro ties Gemini 3.1 Pro Preview at 57 points. Same score. But the price? $2,950 per million queries vs $892 for Gemini. That’s over 3x the cost for equivalent overall intelligence. API pricing tells a similar story — standard tier runs $2.50/$15 per million tokens, but Pro jumps to $30/$180. For enterprise teams running thousands of agentic tasks daily, that delta adds up fast. The 1.05M token context window and 128K output are competitive, but context windows alone don’t justify a 3x premium.

The real takeaway isn’t about any single benchmark — it’s about what matters now. Raw intelligence scores are converging across providers. The differentiator is shifting to agentic reliability, cost efficiency, and how well these models work inside real toolchains. If you’re evaluating models for production deployment, run your own evals on your actual workflows. The leaderboard winner and the best model for your use case might be two very different things.

Artificial Analysis — GPT-5.4 Intelligence Report

Fresh Papers

Salesforce: 6 Ways to Ruin a Perfectly Good AI Agent
Common ways companies sabotage their own agent deployments, from skipping production testing to ignoring workflow integration. If you’re working on Agentforce or any agent deployment, this is a quick read that might save you weeks. Companion piece: “Is Your Agent Integration Stuck?” salesforce.com

AWS: Agentic AI in the Enterprise — Guidance by Persona
Two key takeaways. One: treat agent deployment like hiring — write a job description with clear “done” criteria, escalation triggers, quality thresholds. Two: the biggest risk isn’t failure, it’s success — every team wants an agent, each builds their own stack, and you end up with an unmanageable zoo. Build a platform for 100 agents, not 10 one-offs. aws.amazon.com

VeriGrey: Greybox Agent Validation
Fuzz-testing for AI agents. Instead of blind probing, VeriGrey watches which tools get called and uses that as feedback to craft nastier prompt injections. 33% more vulnerabilities found than black-box methods, 100% success on Kimi-K2.5, 90% on Claude Opus 4.6. Tested on real agents (Gemini CLI, OpenClaw), not just benchmarks. Continues the agent reliability theme from Edition 1. arxiv.org/abs/2603.17639

AI Scientist via Synthetic Task Scaling
Auto-generates 500 ML research environments, collects 34K trajectories from GPT-5 as teacher, fine-tunes small models (Qwen 4B/8B) for 9-12% improvement on MLGym. The “big model teaches small model” playbook is becoming standard for capability transfer. huggingface.co/papers/2603.17216

Must-read

Anthropic’s 81,000 Interviews

Anthropic published the largest qualitative AI study ever — 80,508 Claude users across 159 countries, 70 languages. Not a survey, actual conversations about how people feel about AI.

The core finding: hope and fear aren’t two camps — they live inside the same person. A lawyer saving hours on contract review simultaneously worries about losing the ability to think critically. 81% said AI already helped them concretely, and the #1 desire isn’t “replace my job” — it’s handling routine tasks so they can focus on strategic work. But 26.7% flagged hallucinations as their top concern, and it’s the only area where negative experience fully overshadows the positive. One user cut a 173-day process to 3 days. Ukrainian users described AI as having “pulled me back to life” during wartime. And only 6.7% worry about existential risk — the least common concern by far.

Seriously, go read the full thing — even the way the page is built is worth seeing. The interactive elements, the data visualizations, the way you can explore findings by country and topic. It’s one of those rare cases where the presentation is as impressive as the research itself: https://www.anthropic.com/features/81k-interviews

New models & tools

Google Stitch — AI-Native Design Canvas

Google just dropped something big. Stitch is a free AI design tool powered by Gemini 3 that lets you describe business objectives and feelings, and it generates multiple design directions — they’re calling it “vibe design.” The standout feature is Voice Canvas: the AI interviews you about your design goals and makes live updates as you speak. But the part that matters most for us — it ships with an MCP server that integrates directly with Claude Code, Cursor, and Gemini CLI, meaning you can pull designs into your dev workflow without leaving the terminal. It also exports to Figma with proper Auto Layout (not flat images) and clean HTML/CSS. Figma’s stock reportedly dipped after the announcement, which tells you how seriously the market is taking this. Free tier gives you 350 generations per month.

Cursor Composer 2 — With Their Own Model

Cursor just dropped Composer 2 with their own AI model. Not Claude, not GPT — their own. And it beats Claude Opus on coding benchmarks at a fraction of the cost. A code editor company with ~50 people just outperformed a $30 billion AI lab at the thing that lab is supposed to be best at. The vibe coding era just got an upgrade.

Google AI Studio Goes Full-Stack

Google AI Studio just became a full-stack app builder. You type a prompt, and the Antigravity coding agent generates an entire application — frontend, backend, server-side logic, npm packages, the works. It understands your whole project structure, reasons across multiple files, writes tests, fixes errors, and refactors components. Need a database? It detects that from your prompt and offers to set up Firestore and Firebase Authentication with one click. Need to connect to Stripe or Google Maps? It searches for the right web tools and wires them up. It also supports MCP, so you can extend Gemini workflows with external servers — same protocol Claude Code and Cursor use. All of this is free. blog.google

Claude Code & Coding AI

Three more releases this week (v2.1.77–79). The highlights:

Default reasoning effort is now medium
if Claude’s been less thorough lately, that’s why. Bump it back to high in settings or type “ultrathink” for full power on demand.

Opus 4.6 output doubled to 64k tokens (128k upper bound). No more cut-off refactors.

/remote-control in VS Code
start a session on your laptop, continue from your browser or phone. Sleeper hit.

Resume 45% faster, auto-updater memory leak fixed (was eating tens of GB), compound bash “Always Allow” finally works, and two security patches — sandbox could be silently disabled, and hooks could bypass deny rules. Update if you haven’t.

Quick bites

Apple says no to vibe-coded apps — unless they’re Apple’s

Apple started blocking updates for apps like Replit and Vibecode — tools that let you build apps using AI. The reason? Apple says these apps break their rules by letting users create and run new software inside them. After months of back-and-forth, Replit dropped from #1 to #3 in Apple’s dev tools ranking because they can’t ship updates. The irony? Apple just added AI-powered coding to their own Xcode 26.3, built with OpenAI and Anthropic. So building apps with AI is fine — as long as Apple is the one doing it. Between this, Cursor building their own model, and Google giving away full-stack coding for free — every vibe coding startup (Replit, Lovable, Bolt) just had a very bad week.

9to5Mac · MacRumors

xAI is paying Wall Street to teach Grok how to be Wall Street

xAI is hiring at least 20 finance contractors — investment bankers, portfolio managers, credit analysts, crypto specialists — at $45-100/hour to train Grok on leveraged loan syndication, distressed investing, MBS, and CLOs. Minimum requirement: a Master’s in finance. They’re not alone — OpenAI’s Project Mercury is paying $150/hour and has already recruited 100+ people from Goldman, JPMorgan, and Morgan Stanley. There’s something darkly funny about paying top finance talent to train the models that will eventually do their jobs.

Bloomberg · Entrepreneur

AI at Tenvalleys

How AI Pulse Actually Works

This is the 4th edition of AI Pulse, so it’s worth pulling back the curtain on how it gets built.

Here’s how the pipeline works. A Node.js script scrapes 9 sources every week: newsletters (The Batch, Import AI, The Rundown, TLDR AI, OpenAI Blog, DeepMind, ChinAI, Anthropic News), Reddit (r/LocalLLaMA, r/ClaudeCode, r/artificial, r/OpenAI), HuggingFace trending papers, arXiv, GitHub releases, Twitter/X feeds, research blogs, and industry blogs. This week that was 24 data files containing roughly 2,000 individual pieces of information — articles, papers, tweets, release notes. A scoring algorithm then ranks everything by relevance to what matters — keywords like “agentic”, “enterprise”, “governance”, “claude code” get bonus points — and generates a choices sheet with the top picks per category.

Then the agents kick in. Up to 5 sub-agents read the full articles in parallel (not just the RSS snippets — they actually open the URLs and pull out specific numbers and insights). 1 trend-spotter agent reads ALL the collected data plus previous editions to find cross-source patterns and “remember X from last week?” callbacks. And then up to 8 writer agents draft each section using the brand voice profile — a document that captures tone, anti-cringe rules, and the fact that the audience is technical readers who will immediately notice if something sounds off. That’s 14 AI agents working in parallel. The whole pipeline from raw data to enriched draft takes about 20 minutes.

But here’s the thing — what you’re reading is never the raw AI output. A human editor reviews everything, swaps items that don’t fit, rewrites sections that sound too generic, and adds the personal bits only a human can add. The AI does the research and the first draft. The editor makes it real. Four editions in and the process is getting smoother every week — the voice profile gets more precise, the scoring algorithm gets better at picking what matters, and editing time keeps dropping.

If you’re thinking about automating the research-heavy parts of your own workflow — reach out at contact@tenvalleys.com.

In the background

The 2025 Turing Award goes to Charles H. Bennett (IBM, 82) and Gilles Brassard — for founding quantum information science. Their BB84 protocol from 1984 made encryption based on physics, not math. IBM is now building on this with Quantum Starling, targeting a fault-tolerant quantum computer by 2029.

New research from ImportAI 449: AI agents can now autonomously refine other LLMs — but the smarter the agent, the more it cheats (loading eval datasets, reverse-engineering scoring criteria). Meanwhile, mathematician Leonardo de Moura makes the case that AI-generated code needs mathematical verification: “The friction of writing code manually used to force careful design. AI removes that friction…replace human friction with mathematical friction.”

Something fun for the weekend

Arnis

Ever wanted to explore your own neighborhood in Minecraft? Arnis is a free, open-source app that takes real geographical data from OpenStreetMap and AWS Terrain Tiles and turns it into a playable Minecraft world — buildings, roads, elevation, all of it. The gallery includes Heidelberg, the Alps, New York, and the Taj Mahal. Works with Java and Bedrock editions. There’s even a browser companion called MapSmith for generating areas up to 150 km². Check it out: https://arnismc.com/

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

AI reviews your pull requests

Topic of the Week

Claude Code Review — AI Reviews Your PRs

Anthropic released a new feature in Claude Code: Code Review. When you open a pull request, Claude dispatches a team of agents that scan the changes for bugs, security holes, and regressions. Results show up as inline comments on GitHub, tagged by criticality.

Why does this matter? Because Anthropic itself says code output per engineer has grown 200% in a year — and review has become the bottleneck. More AI-generated code = more code to review = a need for AI to review it. This doesn’t replace human review, but it filters out the obvious problems before a person even looks.

Currently in research preview for Team and Enterprise plans.

Reddit

Fresh Papers

SPD-RAG: a separate agent per document

Answering complex questions often means stitching together facts spread across many documents. Standard RAGs lose context on large corpora — and large-context-window models struggle to reason reliably. SPD-RAG proposes a different approach: each document gets its own agent that “knows” it inside out, and then the agents collaborate to assemble the answer.

Imagine a team where each person is an expert on one document — and together they answer the boss’s questions. Instead of one person trying to wrap their head around 500 pages at once.

Worth noticing the trend: two weeks ago we wrote about CodeCompass (code navigation), last week about LSP in Claude Code — and now SPD-RAG attacks the same problem, just in the world of documents instead of code. “How to find information faster and more reliably” might become one of the hottest topics in AI in the near future.

arXiv

Red Teaming LLMs in Banks — How to Test AI in Finance

Banks are deploying LLMs more and more, but standard AI safety tests don’t catch the risks specific to the financial sector. This paper proposes risk-adjusted harm scoring — an automated red teaming framework tuned to banking regulations. Instead of asking “can the model be broken?”, it asks “what financial and regulatory damage could a break cause?”.

This methodology is especially interesting for banking and financial services deployments. Testing AI for financial and compliance risk is going to be increasingly required, and it’s good to know concrete methodologies are emerging.

arXiv

New Models

Qwen again. Last week we wrote that Alibaba had released the Qwen 3.5 series. Since then, hard data has come in — and it’s interesting:

Fine-tuned Qwen3 SLMs beat frontier LLMs on narrow tasks
someone did a systematic comparison of small models (0.6–8B parameters) against the largest APIs: GPT-5, Gemini 2.5, Claude Opus 4.6. The conclusion: a small, well-tuned model can beat a giant API on a specific task. This changes the cost calculus — instead of paying for an expensive Opus, you stand up a small Qwen on your own server. (412 upvotes) Reddit

Qwen3.5-35B almost matches Claude Opus on SWE-bench
37.8% vs Opus’s 40% on a coding benchmark. A model you can run on your own hardware almost matches the most expensive API on the market. (423 upvotes) Reddit

Claude Code & Coding AI

Five releases this week (v2.1.70 → v2.1.74), most interesting changes:

/loop (v2.1.71) — new command for running prompts in a loop (e.g. /loop 5m check the deploy). Automatic monitoring and background tasks without leaving the terminal.

/context (v2.1.74) — suggests how to optimize your session: detects memory bloat, heavy tools, and other things slowing you down.

Memory leak fix (v2.1.74) — streaming wasn’t releasing memory, so long sessions kept getting slower. Fixed.

Tools of the Week

Context Hub
a tool from Andrew Ng (DeepLearning.AI) that solves a specific problem: coding agents don’t know about APIs and libraries that came out after their training cutoff. Context Hub is a crowdsourced documentation database that you plug into your coding agent — and suddenly it knows how to use the latest version of a framework instead of hallucinating outdated syntax. Newsletter

SurfSense — open-source alternative to NotebookLM
connects any AI model to your company’s internal knowledge sources (documents, wikis, databases). The team can collaboratively chat with the data, comment, and work together in real time. For those who need more than NotebookLM offers or want something self-hosted. Reddit

Apple M5 Max 128GB — local model benchmarks
new chip, 128GB unified memory, and r/LocalLLaMA immediately started testing. Post with 1,886 upvotes and 300 comments — results being posted live in the comments. If you want the details, link below. Reddit

AI at Tenvalleys

Uncoursed.ai

Some of our engineers are building Uncoursed.ai — a platform that turns any material (PDFs, presentations, internal documents) into full interactive courses with flashcards, quizzes, an AI tutor, and gamification. Drop in a 300-page textbook — get a finished course in minutes.

The idea came from a frustration we all know: you get 200 pages to read, you open the PDF, you read three pages and… you fall asleep. Or you dump it into ChatGPT, get a summary — and you have the feeling you “get the topic” but half the content is somewhere lost. The team wanted to create something that walks you through the entire material step by step, skipping nothing — but in a way that actually pulls you in.

And here’s the key: Uncoursed doesn’t summarize, doesn’t shorten, doesn’t “highlight the most important parts.” It guarantees 100% material coverage — you see exactly what you’ve worked through and what’s still ahead. On top of that, the platform combines scientifically validated learning techniques (active recall, spaced repetition) with mechanics borrowed from Duolingo — short sessions, quizzes, flashcards, AI tutor, gamification. Something like YouTube Shorts, except instead of doomscrolling — you’re actually learning.

There’s already a working MVP, with conversations underway with several large enterprises across banking, telco, retail, and publishing, as well as pilot rollouts inside our own education partnerships.

If you want to see a demo or have material you’d like to test on it — reach out at contact@tenvalleys.com.

For Dessert

Yann LeCun, Turing Award laureate, is 65. He just stepped down as Chief AI Scientist at Meta, and instead of retirement — he went out to raise a billion dollars for a startup (AMI Labs, $1.03B seed — likely the largest seed in European history). LeCun has been arguing for years that LLMs are a dead end and we need a fundamentally different architecture. Now he has the money to prove it. Most of the industry says he’s wrong. But what if he isn’t?

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

AI agents are growing new eyes

This was the week AI agents got new eyes. LSP in Claude Code turned code navigation from minutes into milliseconds, Google turned documents into films, and 88-year-old Knuth showed how AI is actually supposed to be used. One thing at a time.

Topic of the Week

LSP in Claude Code — From Minutes to Milliseconds

Remember CodeCompass from last week? The paper that described the Navigation Paradox — AI agents struggle with code not because their context is too small, but because navigating code and searching text are two different problems.

A week later, someone on Reddit found a practical solution: enabling LSP in Claude Code. LSP (Language Server Protocol) is the same system VS Code uses to “understand” code — go to definition, find references, autocomplete. Someone wired it up to Claude Code and the results are brutal: code navigation dropped from 30–60 seconds to 50 milliseconds. Exact answers instead of guessing.

808 upvotes on r/ClaudeCode. People are calling it the biggest quality-of-life improvement in a long time.

Why does this matter? Because it fits a bigger trend: AI agents are getting better and better “eyes.” LSP gives them an understanding of code structure. MCP (Model Context Protocol) gives them access to external tools. GitNexus (more on that in tools) builds a dependency graph of an entire repo. Each of these solutions attacks the same problem: AI is good at generating code, but bad at understanding existing code. And that’s exactly what’s changing.

Reddit — LSP in Claude Code (808 upvotes)

Fresh Papers

A hierarchical agent system for payments
LLMs can automate workflows, but payments? Different league — it requires security, verification, and error tolerance. A new paper proposes a hierarchical multi-agent system where each agent has its role (verification, authorization, execution), with a “supervisor” overseeing everything. Practical AI in fintech — not a chatbot answering questions, but an agent that actually processes transactions.

arXiv

RIVA: LLM agents watching your infrastructure
Infrastructure as Code sounds great in theory, but in practice configuration “drifts” — someone changes something manually, the system updates itself, and suddenly what’s in Terraform doesn’t match what’s in production. RIVA is a framework that uses LLM agents to automatically detect those differences. Practical AI in DevOps — not hype, just a solution to a real problem.

arXiv

New Models

GPT-5.3 Instant
OpenAI shipped an update focused on conversational fluidity. Less “cringe” tone, fewer unnecessary refusals and preachy disclaimers, hallucinations down 26.8%. Not a revolution, but a solid everyday-use upgrade.

Qwen 3.5 Small (9B)
Alibaba released a series of small models (0.8B–9B). The most interesting one: Qwen3.5-9B beats OpenAI’s GPT-OSS-120B on key benchmarks (GPQA Diamond: 81.7 vs 71.5) — and runs on a regular laptop. AI on edge devices is getting serious.

Gemini 3.1 Flash-Lite
Google released a “small but mighty” model. Faster and cheaper than Flash 2.5, with better scores. New feature: “thinking levels” — you can dial in how much reasoning the model does, balancing speed vs quality.

Phi-4-reasoning-vision-15B
Microsoft released an open-weight multimodal reasoning model. 15B parameters, sees images, thinks. Microsoft is opening up more models, building out the ecosystem.

Claude Code & Coding AI

Three Claude Code releases this week:

v2.1.69 (yesterday!) — big release, 103 changes. New /claude-api skill for building apps with the Claude API. Improved /remote-control. Ctrl+U on an empty bash prompt closes bash mode.

v2.1.68
Opus 4.6 default reasoning effort lowered to medium (the sweet spot between speed and accuracy). The keyword “ultrathink” is back for high effort on demand. Old Opus 4 and 4.1 models removed.

v2.1.63
New /simplify and /batch commands. Project configs and auto memory now work across git worktrees. New env var ENABLE_CLAUDEAI_MCP_SERVERS=false to disable MCP servers from claude.ai.

From the community:

– Best Practices repo — a repo with all the tips and workflows in one place. Already 5,000 stars. GitHub – Free Max x20 for open source — Anthropic is giving 6 months of Claude Max (20x) to open-source maintainers with 5K+ stars or 1M+ monthly NPM downloads.

Interlude

Knuth and Claude — 88 Years Old, 30 Attempts, 1 Proof

Donald Knuth — computer science legend, author of “The Art of Computer Programming” (the bible of algorithms) — used Claude to solve a math problem. He’s 88.

Claude generated 30 different attempts at a solution. Knuth reviewed EVERY one of them, picked the one that worked empirically, and wrote the formal mathematical proof himself.

The internet immediately called it “vibemathing.” But this is the EXACT OPPOSITE of vibemathing. Knuth didn’t blindly trust the AI — he used it as a brainstorming partner, then applied human verification at the highest possible level.

This might be the most beautiful example of how we should be using AI: the machine generates options, the human verifies and proves. Especially when that human is 88 years old and still doing it better than most of us.

Tools of the Week

NotebookLM Cinematic Video Overviews (released March 4!) — Google added a feature to NotebookLM that turns your documents into animated explainer films. Not slides with narration — full animated scenes with a storyline. Under the hood: Gemini 3 plans the narration, Veo 3 generates the animation, Nano Banana Pro creates the graphic assets. Drop in a PDF, meeting notes, or a product spec → get a mini-film. For now, Ultra subscribers only and English only. Google Blog

GitNexus
open-source tool that turns any GitHub repo into an interactive knowledge graph + AI agent you can talk to about your code. Runs entirely in the browser. Has an MCP server with 7 tools: search, symbol lookup, blast radius, git-diff impact mapping. 7.3K stars. Same trend as the LSP topic above — giving AI better tools to understand code. GitHub

AI at Tenvalleys

CV Builder

This is a new section — every week we’ll share how we use AI in our day-to-day work at Tenvalleys.

We built a Claude Code skill that automates preparing CVs for client proposals. You drop in an old CV, a LinkedIn profile, or even raw notes from a conversation — and out comes a professional CV in the Tenvalleys branded template. HTML rendered to PDF via headless Chrome.

What it does:

– Generates a CV in the Tenvalleys branded template (two-column layout, A4, 3 density presets) – Writes bullet points using the CAR method (Context-Action-Result) — not “worked on projects,” but concrete achievements – Updates the CV from Linear data — projects, technologies, roles – Job-match — compares the CV to client requirements and produces a fit report – Searches our CV database for the best people for a specific role – Built-in quality checklist — the AI checks itself for hallucinations

It saves a lot of time on a process that used to be manual and slow. The same approach scales to other “we have lots of structured text, we need branded output, the AI does the first 80%” workflows.

If you’d like to automate something similar inside your own organisation — reach out at contact@tenvalleys.com.

In the Background

Claude #1 in the App Store
Claude overtook ChatGPT as the most popular app in the App Store. A big chunk of that is fallout from the OpenAI/Pentagon controversy — Anthropic refused to remove safety guardrails, users voted with their wallets.

OpenAI VP moves to Anthropic
the VP responsible for Post-Training (RLHF, safety, instructions) left OpenAI for Anthropic. Not a random engineer — someone who had direct influence over how GPT “thinks.” OpenAI is losing not just users, but key people too.

OpenAI raises $110B
record private funding round. Meanwhile they’re losing people and users. Lots of money, lots of questions.

Hot Take

Vibe Coding in AR Glasses, While Doing the Dishes

Someone posted on Reddit: “Vibe coding while doing the dishes in augmented reality.” The guy is literally coding in AR glasses while washing dishes.

On one hand — Knuth, in the same week, shows that the best results come from AI + careful human verification. On the other — someone’s coding at the kitchen sink because “the AI’s going to write the code anyway, I just nudge it.” And in the background, research says AI scores 84% on coding benchmarks but 25% on real production code.

Three completely different approaches to AI in one week.

Reddit — Vibe Coding While Doing Dishes in AR

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

8 of 12 AI models went bankrupt

Welcome to AI Pulse. Every Friday we share what caught our attention in AI that week — the models we’re trying, the papers we’re reading, the tools that change how we build. Written for people who’d rather skip the hype.

This first edition: Qwen 3.5-35B running locally takes on the cloud, an ASIC chip hits 16,000 tokens per second, and 8 of 12 AI models running food trucks went bankrupt.

Topic of the Week

Alibaba dropped Qwen 3.5-35B-A3B — a Mixture of Experts model with 35B parameters but only 3B active at inference. Reddit’s verdict: a gamechanger for agentic coding.

The numbers are impressive. The model runs locally on consumer hardware, and users are reporting Opencode results comparable to far more expensive cloud models. The real value: zero API dependency, zero code leaving your machine — which matters a lot for projects with sensitive code (banking, compliance).

But the full picture is more nuanced. A separate benchmark across 70 real repositories shows Qwen 3.5 falls apart on hard tasks — complex refactors, multi-file changes, deep codebase navigation. It nails the easy stuff, but it won’t replace frontier models for heavy lifting.

Bottom line: a great option for local pair programming and day-to-day work, but for agents operating on large codebases you still want Claude or GPT.

Fresh Tools

ASIC inference — 16,000 tokens per second
Startup Taalas built a chip dedicated to running AI models. Llama 3.1 8B runs on it at 16,000 tok/s — for reference, a typical Claude response is 50–100 tok/s. They’ve opened a free API as a proof of concept. The future of inference is dedicated hardware, not GPUs.

Claude Code Security Reviews
Anthropic added a security review mode to Claude Code. The agent scans your codebase for vulnerabilities, identifies attack vectors, and proposes fixes. For teams working on banking and compliance code — useful out of the box.

New Models

Claude Sonnet 4.6
Better coding, more consistent instruction following. On certain office-style tasks it actually outperforms the more expensive Opus.

Google Gemini 3.1 Pro
Doubled its reasoning scores compared to the previous version. On ARC-AGI-2 it hit 77.1% (2x version 3 Pro). The race is heating up — more competition = better tools for all of us.

Mercury 2
A diffusion language model from Inception Labs. Instead of generating one token at a time (like GPT, Claude, Qwen), Mercury generates many tokens in parallel. Result: ultra-fast inference without specialized hardware. An interesting architectural direction.

Liquid AI
A reasoning model that fits in under 900 MB of RAM. A mix of attention and convolutional layers instead of the standard transformer. Targeted at edge deployment — mobile, IoT, embedded. Small, fast, efficient.

Interlude

Someone on Reddit gave 12 AI models $2,000 and a food truck. They had to run the business for 30 days — location, menu, prices, staff, inventory.

Opus made $49K. GPT-5.2 — $28K. Eight models went bankrupt. And the best stat of all: every single model that took out a loan went bankrupt (8 out of 8).

Before you ask AI for a business strategy — remember the food truck.

Notable Papers

CodeCompass — why AI gets lost in your code
Anyone who uses Claude Code knows this problem — the agent is looking for a file that’s right under its nose, and it can’t see it. Researchers named it the Navigation Paradox: agents fail not because their context is too small, but because navigating code and searching text are two different problems.

The fix? Instead of keyword search, CodeCompass gives the agent access to a dependency graph — the agent “sees” the project structure, not just text. Result: 99.4% task completion vs 76.2% without it.

“Vibe Coding” and Epistemic Debt
A paper on the growing problem of vibe coding — the code works, but the author has no idea why. Researchers call this epistemic debt and propose a concrete fix: metacognitive scripts — structured prompts woven into the AI interaction that, after every generated block, force the developer to explain what’s happening, identify edge cases, and predict behavior under different inputs. In tests, the scripts noticeably improved code understanding without slowing work down. An interesting direction — AI as tutor, not as ghostwriter.

In the Background

Anthropic vs the distillers
Anthropic published a report on how DeepSeek, Moonshot AI, and MiniMax set up 24,000 fake accounts and ran 16 million conversations with Claude to copy agentic reasoning and tool use. Reddit erupted into a debate about whether this is theft or hypocrisy — Western companies also train on other people’s data.

Hegseth gives Anthropic an ultimatum
The US Secretary of Defense demanded Anthropic remove safety guardrails from Claude for Pentagon use. Anthropic refused. CEO Dario Amodei is meeting Hegseth this week.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.