Opus 4.7 and the new Claude Code desktop both dropped this week — and the headlines only tell part of the story. In this edition: why the community was furious before the drop, what Anthropic isn’t saying about agent safety, and a paper that reads like a cheat sheet for document-heavy AI in regulated industries.

Topic of the Week

Community Complaints → Opus 4.7 Drops

Last week was rough for Claude power users. An AMD AI director ran 6,852 Claude Code sessions and published data showing thinking depth had dropped 67% — the post hit 1,804 upvotes on Reddit and kicked off a wave of “is it just me or did Claude get dumber?” threads. Another thread with 699 upvotes pointed out the problem wasn’t even Claude-specific: multiple major models seemed to be degrading simultaneously. Opus 4.6 users reported lazy outputs, refusals on previously-fine prompts, and generally bizarre behavior. The vibe was: we’re paying premium prices for models that are quietly getting worse, and nobody’s saying anything.

Then on April 16, Anthropic dropped Opus 4.7 — and the benchmarks read like a direct response to every complaint. 13% coding improvement over 4.6, 3x more production tasks resolved on Rakuten’s benchmark, 21% fewer document reasoning errors (Databricks), and a 98.5% visual acuity score from XBOW’s penetration testing suite. Vision got a massive upgrade too: images up to 2,576 pixels (3.75 megapixels), roughly 3x what prior models could handle. There’s a new xhigh effort level for problems where you want the model to think harder, plus task budgets in public beta for controlling autonomous agent spending. Pricing stays the same — $5/$25 per million tokens — and it’s live on the API, Bedrock, Vertex AI, and Microsoft Foundry. Full announcement here: https://www.anthropic.com/news/claude-opus-4-7

Fresh Papers

Anthropic’s 2026 Agentic Coding Trends Report (full report PDF)

Anthropic published an enterprise whitepaper with 8 trends across three categories (foundation, capability, impact) backed by real case studies. The headline stat that should make everyone pause: developers already use AI in ~60% of their work, but can “fully delegate” only 0–20% of tasks. The gap between “using AI” and “trusting AI to run autonomously” is still massive. The report frames the key shift as implementer → orchestrator — engineers stop writing code line-by-line and start coordinating agents that do. Case studies worth noting: Rakuten ran Claude Code across 12.5M lines of code for 7 hours autonomously with 99.9% numerical accuracy. CRED (15M+ users) doubled their development speed. Augment Code compressed a project estimated at 4–8 months into 2 weeks. Fountain achieved 50% faster screening and 2x candidate conversions with multi-agent orchestration. The report also predicts an “onboarding revolution” — traditional ramp-up from weeks to hours — and that multi-agent systems will replace single-agent workflows as the standard architecture.

The Blind Spot of Agent Safety (paper)

Remember the Princeton reliability paper from Edition #001 — the one showing that agent benchmark scores keep climbing while real-world reliability barely moves? This week’s paper from LIME Lab makes that gap feel even more uncomfortable. They built OS-BLIND, a benchmark with 300 tasks across 12 attack categories, and tested how computer-use agents handle seemingly innocent instructions that cause harm through side-effects — not through adversarial prompts, but through normal-looking tasks that go wrong in context. The results are bleak: average attack success rate above 90% across most agents, including safety-aligned ones. Claude 4.5 Sonnet on its own hits 73% ASR, but put it inside a multi-agent system and that jumps to 92.7%. The most interesting finding is why: safety alignment kicks in during the first few steps of execution, then basically falls asleep. The agent checks itself early, decides everything looks fine, and then sleepwalks through the dangerous parts. For anyone building or deploying computer-use agents, this is a concrete reminder that “safety-aligned” and “safe in production” are still very different things.

Adaptive Query Routing for Financial, Legal, and Medical Documents (paper)

This one is close to home. The paper compares different RAG approaches specifically on financial, legal, and medical documents — the kind of content our banking clients deal with every day. It benchmarks vector-based agentic RAG (the standard approach: embed everything, search by similarity) against hierarchical node-based reasoning (follow the document’s structure and logic instead of just matching text). The results show that the best approach depends on the type of question — some queries need semantic similarity, others need structural navigation, and a tier-based hybrid that routes queries to the right strategy outperforms either approach alone. For anyone building document Q&A systems for regulated industries, this is a concrete comparison of what actually works rather than what sounds good in a blog post.

Also worth a read: Anthropic’s engineering team published how they built their Managed Agents infrastructure — the key pattern is decoupling the “brain” (Claude), “hands” (sandboxes), and “session” (durable event log). Stateless harness, on-demand containers, credentials never in the sandbox. Their p50 time-to-first-token dropped 60%. If you’re building production agent systems, this is the reference architecture.

New Models

Cloudflare + OpenAI: Agent Cloud
Not a model release, a platform play. Purpose-built infrastructure for running AI agents at scale: millisecond cold starts (“100x speed, fraction of the cost of containers”), Git-compatible storage for agent repos, and full Linux sandboxes (now GA). Ships with GPT-5.4, Codex, and open-source models — switching providers is a one-line config change. This landed the same week AWS and Google Cloud made similar moves. The infrastructure layer for AI agents isn’t emerging anymore — it’s crystallizing. openai.com

GPT-5.4-Cyber
OpenAI released a cybersecurity-specialized model with lower refusal boundaries for legitimate security research, as part of their “Trusted Access for Cyber Defense” program. Available to a limited group for now. Meanwhile Anthropic’s Claude Mythos Preview was restricted due to extraordinary cybersecurity capabilities. The signal: specialized, domain-tuned models are becoming a thing — not just general-purpose anymore. Reddit

Claude Code & Coding AI

Claude Code Desktop — full redesign with multi-session support. The headlines: parallel agents (run multiple coding tasks at the same time), visual diffs, PR tracking, live server preview — all inside one desktop app. The standout feature is Coordinator Mode — you spin up parallel sub-agents that work on different parts of a codebase simultaneously while a coordinator keeps them aligned. Available on Pro, Max, Team, and Enterprise plans. Vercel’s teams reported 7.6x more frequent deployments after adopting it. This is Anthropic’s clearest move yet toward “AI as a dev team member” rather than “AI as autocomplete.”

Auto mode
a new permission mode that sits between “approve every action” and “skip all checks.” Claude decides for itself whether each action is safe, while a background classifier blocks risky operations (mass deletions, data exfiltration, destructive bash). If an action gets blocked 3 times in a row or 20 times total, the session falls back to manual. This is Anthropic’s answer to --dangerously-skip-permissions — you get the speed of unattended agent runs without completely removing guardrails. Requires v2.1.83+, available on Max, Team, Enterprise, and API plans (not Pro).

v2.1.101 — massive stability release. 40+ bug fixes, including some that matter a lot if you run long sessions:

– Security fix: command injection vulnerability in LSP binary detection — patch this one – Memory leak fixed: the virtual scroller was retaining dozens of historical copies during long sessions (explains why things got sluggish after a few hours) – 7 separate –resume bug fixes — session resumption should finally feel reliable – Configurable API_TIMEOUT_MS — replaces the hardcoded 5-minute timeout, useful if you’re on slower connections or running heavy prompts – OS CA certificate store trust by default — enterprise teams behind TLS proxies, this one’s for you – /ultrareview — dedicated slash command for thorough code review sessions

/team-onboarding — your habits become documentation. This one deserves its own callout. Run /team-onboarding and Claude scans your last 30 days of usage — which commands you run, what workflows you follow, what patterns you’ve established — and generates a structured ramp-up guide for new team members. Instead of “sit next to someone for a week and figure it out,” a new developer gets a guide based on how your team actually works. If you’re onboarding anyone soon, try it.

Full changelog

Advisor Tool — Opus as a behind-the-scenes strategist. New API tool where a cheaper model (Sonnet or Haiku) runs the entire task but can consult Opus when it gets stuck. Not routing — the executor stays in control, Opus just advises. Results are striking: Haiku + Opus advisor doubled BrowseComp scores (19.7% → 41.2%) while costing 85% less than Sonnet alone. On SWE-bench, Sonnet + Opus advisor scored +2.7pp over Sonnet solo. There’s a max_uses parameter for cost control so the cheap model doesn’t call the expensive one on every step. If you’re building anything with the API and managing costs, this pattern is worth studying. Blog post

AI at Tenvalleys

This week we’ve come up with an idea to organize our first internal 10vOS Skill Hackathon — small teams, four hours, one goal: each team picks a repetitive task they actually hit in their daily work and builds a custom Claude Code skill to automate it. The bet is simple: the most valuable AI tooling is the tooling your team actually uses every day, not the impressive demo nobody opens again.

Stay tuned — we’ll share what came out of it in a few weeks.

If you’ve run something similar inside your company and have lessons to trade, email us at contact@tenvalleys.com or reach out on LinkedIn. We’d love to compare notes.

Worth Reading

Stanford’s 2026 AI Index Report
the annual state-of-AI report just dropped. Key findings: Anthropic currently leads model rankings, US and China are almost neck-and-neck on performance, and AI is being adopted faster than the personal computer or the internet were. IEEE Spectrum summary

See you next week.

Prepared at Tenvalleys — a delivery-first AI engineering partner — by Nikola Powałka. Feedback? Email us at contact@tenvalleys.com or reach out on LinkedIn.