Topic of the Week
AI’s cost wall meets cheap coding
Three things happened this week, and they’re all the same story.
The cost. OpenAI’s Q1 operating margin was –122%, even excluding stock-based compensation, per Amir Efrati. A widely-shared HedgieMarkets post claims a major cloud provider canceled its own internal Claude Code licenses this week — “token-based billing made the cost untenable, even for a company with effectively infinite cloud resources” — and that one large tech company’s CTO sent an internal memo warning it had burned through its entire 2026 AI budget in just the first 4 months. Both claims trace back to a single X post, not to primary company statements — handle with care. In confirmable territory: an AWS user got hit with a $30,000 bill after a Claude agent went runaway on Bedrock, picked up by The Register — and Cost Anomaly Detection didn’t catch it. Different rooms, same conversation.
The response. A r/LocalLLaMA post went up this week from someone who built a coding agent on a 4B-parameter model that scores 87% on benchmarks. Their thesis is exact: “every coding agent (OpenCode, Cursor, Claude Code) assumes you’re running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart.” Same week, CursorBench results dropped via BridgeMind: Cursor Composer 2.5 scores 63.2% at $0.55 per task — nearly matching Opus 4.7 Max and GPT 5.5 Extra High at 1/20th the cost. And Salvatore Sanfilippo (antirez) shipped ds4, a from-scratch Metal/CUDA inference engine for DeepSeek v4 Flash, hitting 27 tokens/sec generation at 11k-token context on an M3 Ultra — with the KV cache designed to live on disk for million-token context on consumer RAM.
The implication. Last week we covered Qwen 3.6 27B running locally on a MacBook and called it the continuation of the local-coding thread. That thread is now a rope. If you can hit 63% of frontier on a $0.55-per-task model — or 87% on a 4B local model — token billing for routine coding work stops making sense. The interesting question isn’t whether the cheap stack catches up; it’s how fast enterprise procurement reprices around it.
Fresh Papers
LongMINT — agent memory is basically a coin flip. A new benchmark from UNC + UT Austin tests every popular memory system (RAG, MemGPT, SimpleMem) on long histories full of small updates, then asks questions that depend on the latest state. Best score: 33.4% (MemAgent). Worst: 21% (no memory layer at all). Everyone sits in the 22–33% range — barely better than guessing.
The failure isn’t the answering — it’s that agents save every new fact as a new memory instead of updating the old one (one framework does this 87.6% of the time). If a customer’s address changes three times, the agent ends up with three “current” addresses. What helps: timestamp every memory entry. RAG goes from losing 31.43 accuracy points to losing 10.45 — 3× better. A cheap fix for any agent tracking evolving state. Read it
OpenAI claims a general-purpose reasoning model cracked an Erdős conjecture. Announced May 19, the post says one of OpenAI’s general-purpose reasoning models found a construction that disproves a conjectured upper bound in Erdős’s planar unit-distance problem — the 1946 question of how many pairs of points among n points in the plane can sit at exactly unit distance from each other. The conjectured cap was around n<sup>1+O(1/log log n)</sup>; the model’s construction beats it. Not a foundation-model release, but a category signal: a generalist reasoning model — not a math-specialist like AlphaProof — produced a result that a working mathematician would write up. r/MachineLearning is doing the validation work in this thread. Worth watching whether the result holds under formal verification — that’s the real test.
Gated DeltaNet-2 (NVIDIA) — worth flagging given this week’s cost theme. Today’s models (Claude, GPT, Gemini) burn money on long inputs because the math behind attention — how the model decides what to focus on — gets exponentially more expensive as the input grows. A whole research direction is trying to replace attention with something cheaper that still works (the Mamba / state-space-model family, plus a few cousins). Ali Hatamizadeh’s team at NVIDIA just shipped a new winner in that race: at 1.3B parameters, Gated DeltaNet-2 beats Mamba-3 and KDA — the previous best alternatives. Translation: the path to cheaper long-context AI is widening. Not in production yet, but the curve is moving.
New Models
Google Gemini Omni. Google DeepMind launched Gemini Omni mid-week — multimodal-to-video. Upload an image, sketch, or screenshot; describe what should happen; get back a video. Min Choi’s thread (“less than 34 hours ago Google dropped Gemini Omni, minds are blown”) hit 1M views, and the trending volume on X confirmed it: 251+ posts within hours. Chris First’s example — a Google Maps screenshot with a route drawn on it, prompted to render the first-person view of driving a taxi along that route — is the kind of “the prompt was an image” workflow that wasn’t tractable a year ago. Pairs naturally with what Logan Kilpatrick announced this week: Gemini 3.5 Flash on GDPval, competing at the frontier despite being a Flash-tier model.
Claude Code & Coding AI
This Wednesday brought Code with Claude London, and Anthropic used the keynote to ship two security improvements to Claude Managed Agents:
– Self-hosted sandboxes (public beta) — keep the agent’s execution environment in your own infrastructure, or with a managed sandbox provider. Your security controls apply by default. – MCP tunnels (research preview) — agents reach MCP servers inside your private network without exposing them to the public internet. Solves the “legal said no to opening the MCP server” blocker for regulated organizations.
This is the one to lead with for any Managed Agents conversation in a regulated industry.
Claude Code shipped 5 versions this week (v2.1.142 → v2.1.146). Top 3:
– v2.1.142 — Fast mode now defaults to Opus 4.7, full claude agents flag suite for dispatching background sessions. – v2.1.144 — Background sessions show up in /resume, with elapsed-duration completion notifications. – v2.1.145 — claude agents --json for scripting, OTEL spans tagged with agent_id/parent_agent_id for proper trace parenting.
Through-line: background agents went from research-preview to first-class citizen this week.
“How Claude Code works in large codebases” — Anthropic engineering post (May 18). Patterns from orgs with thousands of developers running Claude Code in production. Worth a slow read if you’re scaling Claude Code beyond pilot teams.
Codex now controls your locked Mac from your phone. OpenAI shipped this Codex Thursday (May 21): the Codex Mac app can use apps on your Mac from the phone client, even when the Mac is locked. Continues last week’s Codex-everywhere theme — Chrome extension last week, now Mac-from-phone.
Tools of the Week
xAI open-sourced X’s “For You” algorithm. xai-org/x-algorithm — the actual code that decides what you see in your X feed, plus a 3GB pretrained model included in the repo. Already 25.6k stars on GitHub. This basically never happens — Meta, TikTok, and YouTube all keep their recommendation engines locked up as trade secrets — so this is the first credible open-source production recommender with real code and real weights. Worth keeping in mind if you ever need a personalized feed or product-recommendation feature; it saves months of reverse-engineering academic papers.
AIDesigner MCP v2 — clone any URL into your repo. A community-built MCP server (also surfaced on X by @Oluwaphilemon1) that gives any coding agent (Claude Code, Codex, Cursor, Windsurf) three new modes against any URL: clone (1:1 recreation), enhance (improve while keeping intent), inspire (steal a style). Auto-detects the target stack on install (Next.js, React, Vue, Tailwind, Radix, shadcn/ui), writes per-agent config, and offers a live browser canvas paired to the terminal via a 6-character pairing code. Paid, credit-metered (1 credit per URL analysis). Useful for landing-page work where you want to lift a design system in minutes.
AI at Tenvalleys
Our Friday brown-bags are slowly becoming a tradition — different people across the team picking up a tool and walking everyone else through what they’ve learned. This week one of our team showed how he uses Make.com for process automation. Two things worth stealing:
The 80/20 on planning vs. building. He spends about 80% of his time on planning and architecture — mapping the scenario, the data flow, the edge cases — and only 20% on actually building and testing. The thinking: when you skip the planning step, you end up rebuilding the same scenario two or three times. When you plan first, you build once.
A “context reload” trick. He uses a custom command that pulls the entire chat history for a specific feature back into context, so he doesn’t lose the working knowledge across long sessions. His take, which lands particularly hard given this week’s LongMINT paper above: memory management and knowledge retention are still one of the biggest unsolved problems when working with AI.
We make sure everyone at Tenvalleys uses AI in their day-to-day work, and these sessions are how the team gets hands-on with the same tooling we ship to clients. Interested in building that kind of practice in your own team? Reach out at contact@tenvalleys.com.
For Dessert
Andrej Karpathy joined Anthropic. “Returning to R&D and Pre-training,” he wrote. A few weeks ago at Sequoia AI Ascent he said he’d “never felt more behind” on the pace of AI — and the team he’s joining had a notable week of its own: KPMG (276K employees) signed on as a global partner, the SDK toolkit Stainless got acquired, and for the first time Claude passed ChatGPT in US business adoption. Karpathy following the gravity, not making it — but a nice signal regardless.
Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.


