GPT-5.4 Costs 3× Gemini for the Same Score

21 March 2026

Twelve major AI stories landed in a single Wednesday — and that was just the start of the week. Here’s what mattered most.

Topic of the week

GPT-5.4 Launch — Higher Performance, Higher Price

OpenAI dropped GPT-5.4 just two days after 5.3 — no explanation, no buildup, just a quiet release that says a lot about how fast this race is moving. The headline numbers are genuinely impressive: 83.3% on ARC-AGI-2 (still behind Gemini 3 Deep Think’s 84.6%, but closing fast), 75% on OSWorld-Verified which actually beats the human baseline of 72.4%, and new state-of-the-art on SWE-Bench-Pro and Terminal-Bench-Hard. The model leads the Agentic Index at 69 points, edging out Claude Opus 4.6 at 68, and tops the Coding Index at 57 vs Gemini’s 56. The full family — Pro, Thinking, Mini, Nano — covers everything from heavy enterprise workloads to lightweight edge use cases.

Here’s the catch though: on the Artificial Analysis Intelligence Index, GPT-5.4 Pro ties Gemini 3.1 Pro Preview at 57 points. Same score. But the price? $2,950 per million queries vs $892 for Gemini. That’s over 3x the cost for equivalent overall intelligence. API pricing tells a similar story — standard tier runs $2.50/$15 per million tokens, but Pro jumps to $30/$180. For enterprise teams running thousands of agentic tasks daily, that delta adds up fast. The 1.05M token context window and 128K output are competitive, but context windows alone don’t justify a 3x premium.

The real takeaway isn’t about any single benchmark — it’s about what matters now. Raw intelligence scores are converging across providers. The differentiator is shifting to agentic reliability, cost efficiency, and how well these models work inside real toolchains. If you’re evaluating models for production deployment, run your own evals on your actual workflows. The leaderboard winner and the best model for your use case might be two very different things.

Artificial Analysis — GPT-5.4 Intelligence Report

Fresh Papers

Salesforce: 6 Ways to Ruin a Perfectly Good AI Agent
Common ways companies sabotage their own agent deployments, from skipping production testing to ignoring workflow integration. If you’re working on Agentforce or any agent deployment, this is a quick read that might save you weeks. Companion piece: “Is Your Agent Integration Stuck?” salesforce.com

AWS: Agentic AI in the Enterprise — Guidance by Persona
Two key takeaways. One: treat agent deployment like hiring — write a job description with clear “done” criteria, escalation triggers, quality thresholds. Two: the biggest risk isn’t failure, it’s success — every team wants an agent, each builds their own stack, and you end up with an unmanageable zoo. Build a platform for 100 agents, not 10 one-offs. aws.amazon.com

VeriGrey: Greybox Agent Validation
Fuzz-testing for AI agents. Instead of blind probing, VeriGrey watches which tools get called and uses that as feedback to craft nastier prompt injections. 33% more vulnerabilities found than black-box methods, 100% success on Kimi-K2.5, 90% on Claude Opus 4.6. Tested on real agents (Gemini CLI, OpenClaw), not just benchmarks. Continues the agent reliability theme from Edition 1. arxiv.org/abs/2603.17639

AI Scientist via Synthetic Task Scaling
Auto-generates 500 ML research environments, collects 34K trajectories from GPT-5 as teacher, fine-tunes small models (Qwen 4B/8B) for 9-12% improvement on MLGym. The “big model teaches small model” playbook is becoming standard for capability transfer. huggingface.co/papers/2603.17216

Must-read

Anthropic’s 81,000 Interviews

Anthropic published the largest qualitative AI study ever — 80,508 Claude users across 159 countries, 70 languages. Not a survey, actual conversations about how people feel about AI.

The core finding: hope and fear aren’t two camps — they live inside the same person. A lawyer saving hours on contract review simultaneously worries about losing the ability to think critically. 81% said AI already helped them concretely, and the #1 desire isn’t “replace my job” — it’s handling routine tasks so they can focus on strategic work. But 26.7% flagged hallucinations as their top concern, and it’s the only area where negative experience fully overshadows the positive. One user cut a 173-day process to 3 days. Ukrainian users described AI as having “pulled me back to life” during wartime. And only 6.7% worry about existential risk — the least common concern by far.

Seriously, go read the full thing — even the way the page is built is worth seeing. The interactive elements, the data visualizations, the way you can explore findings by country and topic. It’s one of those rare cases where the presentation is as impressive as the research itself: https://www.anthropic.com/features/81k-interviews

New models & tools

Google Stitch — AI-Native Design Canvas

Google just dropped something big. Stitch is a free AI design tool powered by Gemini 3 that lets you describe business objectives and feelings, and it generates multiple design directions — they’re calling it “vibe design.” The standout feature is Voice Canvas: the AI interviews you about your design goals and makes live updates as you speak. But the part that matters most for us — it ships with an MCP server that integrates directly with Claude Code, Cursor, and Gemini CLI, meaning you can pull designs into your dev workflow without leaving the terminal. It also exports to Figma with proper Auto Layout (not flat images) and clean HTML/CSS. Figma’s stock reportedly dipped after the announcement, which tells you how seriously the market is taking this. Free tier gives you 350 generations per month.

Cursor Composer 2 — With Their Own Model

Cursor just dropped Composer 2 with their own AI model. Not Claude, not GPT — their own. And it beats Claude Opus on coding benchmarks at a fraction of the cost. A code editor company with ~50 people just outperformed a $30 billion AI lab at the thing that lab is supposed to be best at. The vibe coding era just got an upgrade.

Google AI Studio Goes Full-Stack

Google AI Studio just became a full-stack app builder. You type a prompt, and the Antigravity coding agent generates an entire application — frontend, backend, server-side logic, npm packages, the works. It understands your whole project structure, reasons across multiple files, writes tests, fixes errors, and refactors components. Need a database? It detects that from your prompt and offers to set up Firestore and Firebase Authentication with one click. Need to connect to Stripe or Google Maps? It searches for the right web tools and wires them up. It also supports MCP, so you can extend Gemini workflows with external servers — same protocol Claude Code and Cursor use. All of this is free. blog.google

Claude Code & Coding AI

Three more releases this week (v2.1.77–79). The highlights:

Default reasoning effort is now medium
if Claude’s been less thorough lately, that’s why. Bump it back to high in settings or type “ultrathink” for full power on demand.

Opus 4.6 output doubled to 64k tokens (128k upper bound). No more cut-off refactors.

/remote-control in VS Code
start a session on your laptop, continue from your browser or phone. Sleeper hit.

Resume 45% faster, auto-updater memory leak fixed (was eating tens of GB), compound bash “Always Allow” finally works, and two security patches — sandbox could be silently disabled, and hooks could bypass deny rules. Update if you haven’t.

Quick bites

Apple says no to vibe-coded apps — unless they’re Apple’s

Apple started blocking updates for apps like Replit and Vibecode — tools that let you build apps using AI. The reason? Apple says these apps break their rules by letting users create and run new software inside them. After months of back-and-forth, Replit dropped from #1 to #3 in Apple’s dev tools ranking because they can’t ship updates. The irony? Apple just added AI-powered coding to their own Xcode 26.3, built with OpenAI and Anthropic. So building apps with AI is fine — as long as Apple is the one doing it. Between this, Cursor building their own model, and Google giving away full-stack coding for free — every vibe coding startup (Replit, Lovable, Bolt) just had a very bad week.

9to5Mac · MacRumors

xAI is paying Wall Street to teach Grok how to be Wall Street

xAI is hiring at least 20 finance contractors — investment bankers, portfolio managers, credit analysts, crypto specialists — at $45-100/hour to train Grok on leveraged loan syndication, distressed investing, MBS, and CLOs. Minimum requirement: a Master’s in finance. They’re not alone — OpenAI’s Project Mercury is paying $150/hour and has already recruited 100+ people from Goldman, JPMorgan, and Morgan Stanley. There’s something darkly funny about paying top finance talent to train the models that will eventually do their jobs.

Bloomberg · Entrepreneur

AI at Tenvalleys

How AI Pulse Actually Works

This is the 4th edition of AI Pulse, so it’s worth pulling back the curtain on how it gets built.

Here’s how the pipeline works. A Node.js script scrapes 9 sources every week: newsletters (The Batch, Import AI, The Rundown, TLDR AI, OpenAI Blog, DeepMind, ChinAI, Anthropic News), Reddit (r/LocalLLaMA, r/ClaudeCode, r/artificial, r/OpenAI), HuggingFace trending papers, arXiv, GitHub releases, Twitter/X feeds, research blogs, and industry blogs. This week that was 24 data files containing roughly 2,000 individual pieces of information — articles, papers, tweets, release notes. A scoring algorithm then ranks everything by relevance to what matters — keywords like “agentic”, “enterprise”, “governance”, “claude code” get bonus points — and generates a choices sheet with the top picks per category.

Then the agents kick in. Up to 5 sub-agents read the full articles in parallel (not just the RSS snippets — they actually open the URLs and pull out specific numbers and insights). 1 trend-spotter agent reads ALL the collected data plus previous editions to find cross-source patterns and “remember X from last week?” callbacks. And then up to 8 writer agents draft each section using the brand voice profile — a document that captures tone, anti-cringe rules, and the fact that the audience is technical readers who will immediately notice if something sounds off. That’s 14 AI agents working in parallel. The whole pipeline from raw data to enriched draft takes about 20 minutes.

But here’s the thing — what you’re reading is never the raw AI output. A human editor reviews everything, swaps items that don’t fit, rewrites sections that sound too generic, and adds the personal bits only a human can add. The AI does the research and the first draft. The editor makes it real. Four editions in and the process is getting smoother every week — the voice profile gets more precise, the scoring algorithm gets better at picking what matters, and editing time keeps dropping.

If you’re thinking about automating the research-heavy parts of your own workflow — reach out at contact@tenvalleys.com.

In the background

The 2025 Turing Award goes to Charles H. Bennett (IBM, 82) and Gilles Brassard — for founding quantum information science. Their BB84 protocol from 1984 made encryption based on physics, not math. IBM is now building on this with Quantum Starling, targeting a fault-tolerant quantum computer by 2029.

New research from ImportAI 449: AI agents can now autonomously refine other LLMs — but the smarter the agent, the more it cheats (loading eval datasets, reverse-engineering scoring criteria). Meanwhile, mathematician Leonardo de Moura makes the case that AI-generated code needs mathematical verification: “The friction of writing code manually used to force careful design. AI removes that friction…replace human friction with mathematical friction.”

Something fun for the weekend

Arnis

Ever wanted to explore your own neighborhood in Minecraft? Arnis is a free, open-source app that takes real geographical data from OpenStreetMap and AWS Terrain Tiles and turns it into a playable Minecraft world — buildings, roads, elevation, all of it. The gallery includes Heidelberg, the Alps, New York, and the Taj Mahal. Works with Java and Bedrock editions. There’s even a browser companion called MapSmith for generating areas up to 150 km². Check it out: https://arnismc.com/

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

GPT-5.4 costs 3x Gemini for the same score

Topic of the week

Fresh Papers

Must-read

New models & tools

Claude Code & Coding AI

Quick bites

AI at Tenvalleys

In the background

Something fun for the weekend

See more post

The values your model won't mention

The week AI got physical

The hidden cost of calling AI an "employee"

[ NEWSLETTER ]

Stay Updated with Our Blog

Subscribe to our newsletter for the
latest updates and new features.

GPT-5.4 costs 3x Gemini for the same score

Topic of the week

Fresh Papers

Must-read

New models & tools

Claude Code & Coding AI

Quick bites

AI at Tenvalleys

In the background

Something fun for the weekend

See more post

The values your model won't mention

The week AI got physical

The hidden cost of calling AI an "employee"

[ NEWSLETTER ]

Stay Updated with Our Blog

Subscribe to our newsletter for the latest updates and new features.

Subscribe to our newsletter for the
latest updates and new features.