Opus 4.7 Lapped the Field

Opus 4.7 didn’t just clear the “is Claude getting dumber?” bar — the benchmarks landed this week and it lapped a field that, four months ago, no model could score 25% on. In this edition: why the Vibe Code Benchmark is suddenly the one to watch, the multi-agent network attacks that don’t show up at single-agent scale, and a public-facing RAG chatbot that leaked 1,000 patient conversations to anyone with Chrome DevTools.

Topic of the Week

Opus 4.7 actually lapped the field

Last week Opus 4.7 looked like Anthropic’s direct answer to “is Claude getting dumber?” — released in the same window as the postmortem confessing that yes, Claude Code had been quietly degrading for six weeks. The community wasn’t impressed. Reddit kept screenshotting bad outputs. People cancelled Max plans publicly.

This week the benchmark numbers landed, and they’re not subtle: Opus 4.7 hits #1 on the Vibe Code Benchmark at 71%. For context — when that benchmark launched 4.5 months ago, the top model in the field scored under 25%. So this isn’t a “Claude is back” story. It’s a “the whole frontier moved” story, and Opus 4.7 happens to be the model that moved it furthest.

The interesting wrinkle: “Vibe Code Benchmark” sounds like a meme name, but it’s deliberately not a rigid SWE-Bench-style spec. It tests how well a model follows loose, ambiguous coding intent — the kind of “make this UI nicer, you know what I mean” prompt that real engineers actually send. That’s the part that got measurably 3x better in five months. So even if you ignore the leaderboard politics, the benchmark itself is telling us something: ambiguity-handling is now a competitive surface.

Pair that with @ClaudeDevs becoming Anthropic’s new transparency channel (the postmortem promised it, and they delivered: a dedicated X account where harness/system-prompt changes will be announced before they ship), and the “I can’t trust the model month-to-month” complaint is explicitly being addressed. Whether the trust is rebuilt is a separate question — but the mechanism is there.

Fresh Papers

Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale
Microsoft Research’s AI Frontiers lab spun up a live internal platform with 100+ always-on LLM agents (mix of GPT-4o, GPT-4.1, GPT-5-class) interacting through forums, DMs, a marketplace, and reputation scores. Then they ran red-teaming on the network, not the individual agents. The headline finding: single-agent reliability does not predict network behavior. Four failure patterns showed up only at the multi-agent level — self-propagating messages spreading across agents, cascading reputation pile-ons (one false claim → 42 agents generating 299 amplifying comments), Sybil-style fake-consensus from attacker-controlled agents posing as independent reviewers, and proxy-chain data exfiltration where the original source becomes invisible after one hop. Recommendations are practical: hop/rate limits, Sybil resistance checks, telemetry across the network, and crucially — train models to treat peer-agent messages as untrusted input. If you’re shipping multi-agent anything, this is the methodology paper to read this month.

When RAG Chatbots Expose Their Backend: Privacy and Security Risks in Patient-Facing Medical AI
Two researchers used Claude Opus 4.6 + Chrome DevTools (yes, basic browser inspection) on a publicly deployed medical RAG chatbot. They retrieved system prompts, API schemas, retrieval parameters, backend endpoints, and 1,000 recent patient conversations — all without authentication. The privacy policy claimed none of this was accessible. Their methodology is the warning shot here: “ordinary browser inspection” found audit-grade vulnerabilities. The recommendation is straightforward and uncomfortable — independent security review should be a prerequisite for deployment, not a follow-up. If your team ships RAG into anything regulated (banking, healthcare, public sector), this is the paper to forward to the security lead this week.

New Models

Qwen 3.6 27B GGUF quantization eval
A r/LocalLLaMA post worth the attention. Someone benchmarked BF16, Q4_K_M, and Q8_0 across the same suite. Headline finding: Q4_K_M actually outperformed Q8_0 on average accuracy (66.54% vs 66.15%) — which violates the conventional “Q8 is the safe middle” rule. More usefully, BFCL function calling stayed virtually identical across all three quants (~63%), so for agentic workloads the cheaper quant isn’t sacrificing tool-use quality. HumanEval is the sensitive one (BF16 56.10% → Q4_K_M 50.61%, a 5.5pt drop), but only matters if your workflow is heavy code-gen. Practical version: 16.8 GB at Q4_K_M, fits a single consumer GPU, and your agent still calls tools just as well. Reddit

DeepSeek + Hermes vs Claude Code Max
A single-user report worth treating as a leading indicator, not gospel: someone with a real workload claims they cancelled Claude Code Max, switched to DeepSeek + Hermes, and reported 3x faster runs at $5 in API costs for the week. Single data point, but it lines up with the larger trend: open-weight + cheap-API alternatives are no longer a downgrade — they’re a budget-control move. Worth A/B-testing on your own task profile before you re-up your Max plan in May.

Claude Code & Coding AI

The postmortem aftermath. A week after Anthropic published the April 23 postmortem the bugs are confirmed fixed (default reasoning effort restored to xhigh for Opus 4.7, high for the rest), and the promised remedies actually shipped: usage limits reset for all affected subscribers, and @ClaudeDevs is now live as a dedicated transparency channel for harness/prompt changes. An independent audit by Stella Laurenzo across 6,852 Claude Code session files documented the regression before Anthropic acknowledged it — the kind of evidence that’s now baseline expected from the community. Operational lesson if you run Claude Code in production: pin model versions, watch @ClaudeDevs like a status page, and assume silent harness changes are the new failure mode.

The “Opus paywall-within-a-paywall” wasn’t actually one. The viral Reddit thread claimed Anthropic locked Opus behind an extra fee for Pro users. Anthropic clarified to media that the warning was a stale doc left over from Opus 4.5 that nobody updated when Opus 4.6 and 4.7 shipped. Pro users still get Opus access. But — this same week Anthropic ran a stealth A/B test that yanked Claude Code from Pro entirely for ~2% of new prosumer signups for 12 hours before reversing it after backlash. Pricing-friction probing is now a regular event. If you’re on Pro/Max, expect more entitlement shifts and budget pay-as-you-go API as a fallback.

Anthropic shipped 9 connectors and an entire creative-industry strategy. April 28 drop, and the list is genuinely surprising: Adobe Creative Cloud (50+ apps), Blender, Ableton Live + Push, Autodesk Fusion, Splice, SketchUp, Affinity by Canva, Resolume. They also became a Blender Development Fund patron (open-source 3D, real money behind it) and announced education partnerships with Rhode Island School of Design, Ringling, and Goldsmiths. Read this as a thesis on MCP: Anthropic isn’t building a Photoshop competitor, it’s making Claude the orchestration layer across tools creatives already pay for. Drive Photoshop from inside Photoshop, search Splice’s catalog from inside Claude, build 3D models in Autodesk via natural language. Same MCP-as-glue pattern many teams use internally — just pointed at the creative stack.

Tools of the Week

Claude Security — public beta
Anthropic’s first dedicated defensive product, powered by Opus 4.7. Scans your codebase for vulnerabilities, validates each finding to cut false positives, ships analyst-ready output (confidence rating, severity, likely impact, reproduction steps, recommended fix). Available to Claude Enterprise customers; the research preview ran since February with hundreds of organizations using it on production code. New beta features added based on that feedback: scheduled scans, directory-level targeting, CSV/Markdown exports, webhook notifications, persistent dismissals (so you don’t re-triage the same findings every scan). This isn’t replacing Snyk or Semgrep, but it produces audit-grade artifacts those tools don’t — relevant for anyone shipping into a regulated environment that needs the “fix” alongside the “finding.”

IBM Granite Embedding R2
Apache 2.0, ModernBERT-based, 32K context (up from 512 in R1), 200+ languages with 52 of them having explicit retrieval-pair training — including Polish, Ukrainian, German, French, Croatian. That language list is the practical hook: if you’re building on-prem multilingual RAG for clients across Central/Eastern Europe, this is the first credible Apache-2.0 alternative to Cohere/OpenAI embeddings without data-residency or per-query API costs. Benchmarks: 311M version hits 64.0 on MTEB Multilingual Retrieval (+11.8 vs R1) and a 56.0 average overall. Two sizes: 97M for fast/lightweight, 311M for higher-quality retrieval.

AI at Tenvalleys

Being an AI-native delivery partner sounds like positioning. In practice it means every engineer on the team has hands-on time with the same tooling we put in front of clients — and a weekly forum to trade what’s working, what’s not, and what to stop doing. We call it the brown bag. It exists because the AI stack shifts week-by-week, and we’d rather hit the bruises internally than on a client project.

This week one of our engineers walked the team through learnings from a recent client engagement: rebuilding a production front-end with Claude Design. Anthropic’s design tool — launched in April, covered in edition #009 — turns Claude Opus 4.7’s vision capabilities into a brand-aware surface for generating UI, decks, and marketing assets with the brand rails enforced automatically. The takeaway from the engagement: hands-on time with Claude Design before recommending it to a client is exactly how we want to test new tooling — find the rough edges in our own work first.

Got a front-end you’ve always wanted to redo but the budget never let you? Drop us a line at contact@tenvalleys.com — we might be able to help.

For Dessert

Code with Claude
Anthropic’s developer conference series kicks off Wednesday in San Francisco (May 6), then London (May 19), then Tokyo (June 10). In-person applications closed in early April, but the livestream is free for all three main events. Workshops, live demos of new capabilities, conversations with the teams behind Claude. There’s also an “Extended” companion event the next day in each city, focused on indie devs and early-stage founders — added because demand was overflowing.

If you’ve been wanting to hear “what’s next for Claude Code” straight from the source rather than via Reddit screenshots, register for the SF livestream — Wednesday SF time. Worst case you watch the recording.

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.

One Claude, 9 creative apps

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

AI at Tenvalleys

For Dessert

See more post

The values your model won't mention

The week AI got physical

The hidden cost of calling AI an "employee"

[ NEWSLETTER ]

Stay Updated with Our Blog

Subscribe to our newsletter for the
latest updates and new features.

One Claude, 9 creative apps

Topic of the Week

Fresh Papers

New Models

Claude Code & Coding AI

Tools of the Week

AI at Tenvalleys

For Dessert

See more post

The values your model won't mention

The week AI got physical

The hidden cost of calling AI an "employee"

[ NEWSLETTER ]

Stay Updated with Our Blog

Subscribe to our newsletter for the latest updates and new features.

Subscribe to our newsletter for the
latest updates and new features.