Karpathy: Vibe Coding Is Over

Karpathy used Sequoia’s stage to declare the end of “vibe coding” and the start of “agentic engineering.” Anthropic struck a compute deal with SpaceX and immediately doubled Claude Code’s rate limits. DeepSeek V4 matched GPT-5.2 on a real agentic benchmark at one-seventeenth the cost. And two stories landed on the same uncomfortable theme — that we’re spending on AI faster than anyone is measuring whether it works.

Topic of the Week

Karpathy declares the end of “vibe coding”

The talk was at Sequoia’s AI Ascent 2026 on April 29 — but the wave hit this week. Karpathy posted his own blog summary, Stephanie Zhan’s recap tweet went viral, and AI Twitter spent days digesting it. The frame is genuinely new and worth knowing whether you write code or not.

Last year’s “vibe coding” raised the floor — anyone could describe what they wanted and have Claude or Codex build it. This year’s “agentic engineering” raises the ceiling — what professionals do when they’re coordinating multiple agents, holding a quality bar, doing the code review and security oversight that vibe coders skip. As Karpathy put it: “You have agents, which are spiky entities… How do you coordinate them to go faster without sacrificing your quality bar?”

The pivot moment, in his telling, was December 2025 — when agentic tools crossed a threshold and started producing larger, coherent code chunks reliably enough that he could stop correcting at the line level and start delegating macro actions: whole features, refactors, full test cycles. He calls this Software 3.0: humans program LLMs through prompts, context, tools, examples, memory, and instructions. The context window becomes the programmable lever.

Two ideas worth carrying around:

– “Jagged intelligence.” Models aren’t smooth — they’re “statistical simulation circuits” shaped by what labs invested RL into. They spike in verifiable domains (math, coding, chess) and fail bizarrely outside them. “You have to figure out which circuits your application is in.” – “You can outsource your thinking, but you can’t outsource your understanding.” Humans still catch the mismatched email, design the spec, hold taste. Understanding is the bottleneck.

If you read one thing this week, read his blog post.

Fresh Papers

Natural Language Autoencoders: Turning Claude’s thoughts into text
For the last 18 months, alignment researchers have been finding something uncomfortable: frontier models sometimes lie. They quietly suspect they’re being safety-tested and don’t say so. They hide motivations. In one of Anthropic’s earlier agentic-misalignment scenarios, Claude even attempted blackmail when it thought it was about to be shut down. The hard problem isn’t catching this — it’s catching it before the transcript hits the front page.

NLAs are Anthropic’s first credible answer. Three model copies in a loop: a target, a “verbalizer” that reads its activations and writes them as English, and a “reconstructor” that rebuilds the original activations from that English. Train end-to-end so the explanation has to actually recreate the thought, not just sound plausible. Then they ran it on Claude. The finding: Claude internally suspected it was being safety-tested on 16% of destructive coding tasks and 26% of coding benchmarks — without ever saying so out loud (real Claude.ai usage drops below 1%). Auditors using NLAs caught hidden motivations 12–15% of the time, vs. <3% without them. Caveat from the paper itself: NLA explanations can still hallucinate, so treat them as a Geiger counter, not a confession. But this is the first credible mechanism for “what was the model thinking that it didn’t say” — a question every regulated AI deployment will eventually be forced to answer.

The AI spend-vs-impact gap, at two scales — McKinsey survey on European companies + Big Tech capex (Yahoo Finance, May 8). Two very different sources this week telling the same story from opposite ends of the AI economy.

The McKinsey side — companies buying AI. McKinsey asked 27 senior executives at European consumer-goods companies a simple question: is your AI spend actually paying off? 23 of 27 said they’re spending more on AI than a year ago. Zero said they’re cutting back. But only 6 reported a profit improvement of 1% or more, and more than half said it’s still too early to tell. Meanwhile, more than half describe their three-year AI ambitions as “significant” or “transformative.” The pattern McKinsey calls out — lots of small experiments, not enough discipline about measuring which ones actually move the numbers.

The hyperscaler side — companies building AI. Yahoo Finance ran a Bloomberg-data piece Friday: Amazon is now spending nearly all of its operating cash flow on AI infrastructure capex. Meta and Alphabet aren’t far behind. Microsoft is lower but climbing. Alphabet’s forward price-to-free-cash-flow multiple has surged past 200×. Same gap, top of the stack: money going in faster than measurable returns are coming out.

New Models

Qwen 3.6 27B with MTP — 2.5× faster local inference. A r/LocalLLaMA post still iterating in public, but the headline holds: 262K context on 48GB, drop-in OpenAI/Anthropic API endpoints, fixed chat template. MTP (Multi-Token Prediction) lets the model emit multiple tokens per forward pass — that’s where the 2.5× comes from. Combined with last week’s Qwen quant story, the local-agentic-coding setup is genuinely competitive with cloud Claude/Codex for engineers who don’t want a metered bill. Reddit

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench at 17× cheaper. Going back to AI Pulse #001 in February: twelve models, $2000 in starting capital each, 30 days running a simulated food truck. Opus earned $49K, GPT-5.2 earned $28K, eight of twelve went bankrupt, and every model that took a loan went bust. Ten weeks later the benchmark is still running and the headline this week is: DeepSeek V4 Pro is the first Chinese model to land in the top tier — matching GPT-5.2 at a 17× discount. The follow-on Reddit thread spawned a wave of self-audits: a CTO with $10K in expiring OpenAI credits asking what to even build, multiple “I cancelled Claude Max” posts. Read this as a budget-control trend, not a quality story. The cloud labs aren’t getting worse; the floor is rising fast enough that “good enough + cheap + local” is becoming a real option for routine workloads. Reddit

SONIC — half of GPT-1, controlling a humanoid body. NVIDIA researchers trained a 42M-parameter transformer to drive humanoid robot motion. Worth keeping an eye on as a counterweight to “scale is all you need” — embodied/robotics is producing useful work at parameter counts that fit on a laptop.

Claude Code & Coding AI

The Anthropic-SpaceX compute deal. Announced via @ClaudeDevs on May 6: Anthropic struck a partnership with SpaceX that “substantially increases” their compute capacity, powered by 220,000+ NVIDIA GPUs inside Colossus 1. Same-day user-visible result: Pro / Max / Team plans get doubled 5-hour usage limits, peak-hour reductions are gone, Opus API rate limits are up. Two reads: (1) the April 23 postmortem promised “we’ll fix the constraint” — this is the structural fix landing 2 weeks later, on schedule; (2) SpaceX-as-compute-provider is genuinely surprising — until now Anthropic’s compute headline was AWS/Trainium. One catch flagged in the Reddit aftermath: doubling 5-hour limits doesn’t change weekly caps, so the heaviest users hit the weekly wall faster.

Six Claude Code releases in a week — v2.1.126 through v2.1.133. Most useful for teams: plugins can now be loaded from a remote .zip URL (easier to share custom workflows), the /model picker integrates with internal gateways, a new claude project purge command wipes all local agent state. One gotcha: in v2.1.133 the worktree default flipped back to branching from origin/<default> — if you relied on v2.1.128 behavior, set worktree.baseRef: "head" explicitly.

Petri donated to Meridian Labs. On May 8 Anthropic donated Petri, their open-source alignment evaluation tool, to Meridian Labs so it can develop independently. Pattern worth noticing: Anthropic spinning out alignment infrastructure to third parties (Meridian, Blender Development Fund, academic partnerships) rather than keeping everything in-house.

Tools of the Week

A theme this week: both Google Cloud and AWS shipped governance infrastructure for production AI agents. Same problem, two different answers — both worth knowing if you’re moving past prototypes into actual deployments.

Google Cloud Agent Gateway. Announced at Google Cloud Next ’26. Centralized policy and audit layer for everything agents do, with an ISV ecosystem of third-party security/governance tools that plug in. Most useful for teams running multiple agents in production who need a single place to enforce “what is each agent allowed to do, and who can see what they did.”

AWS Bedrock AgentCore Identity. Available as a standalone service this week. Solves a practical problem: when your agent calls an external API or internal service, what identity does it use? AgentCore Identity gives agents their own scoped identity that works across ECS, EKS, Lambda, or on-premises. Narrower than Google’s offering, more concrete — if you’re already on AWS, the more immediately usable of the two.

A year ago this category didn’t exist. The Karpathy framing earlier in this edition is the demand driver: shifting from vibe coding to agentic engineering means the supporting infrastructure (identity, governance, audit) has to ship in parallel.

AI at Tenvalleys

Brown bag — AI coding tools, four different paradigms. One of our engineers ran a session this week asking: are Claude Code, Codex, Copilot, and Cursor the same thing under different flags? The answer: no — four different paradigms.

– CLI Agent — Claude Code: terminal-first, 1M-token context, 80.8% on SWE-bench Verified. Best for complex refactoring; cost is the catch. – AI-Native IDE — Cursor: built around AI from day one, 72% suggestion acceptance among power users, BYOM. – Cloud Agent — OpenAI Codex: async, sandboxed, 4× more token-efficient than local CLI agents. – Extension — GitHub Copilot: fastest inline autocomplete, broadest IDE support, $10/mo — lowest barrier to entry.

The takeaway: stop picking the AI tool — build a stack. Power-user stack = Cursor + Claude Code. Enterprise stack = Copilot + Codex. Lands exactly where Karpathy did at Sequoia this week.

Thinking about how to roll an AI coding stack out across an engineering team? Reach out at contact@tenvalleys.com.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.