Claude Opus 4.7 Review: Is It Worthy of the Title of Strongest Model?

By: blockbeats|2026/04/17 23:39:13

Original Title: "Opus 4.7 Never Intended to Be the 'Strongest Model': Everyone Hyped Claude's Speed Can't Keep Up with Anthropic's Pace"

Original Source: Silicon Pro

On April 16, 2026, Anthropic officially released Claude Opus 4.7, just over two months after the previous generation Opus 4.6.

After a recent intense and frantic series of product and model updates, Anthropic's unveiling of a new model naturally gave off a grandiose vibe. I'm sure you've seen many first-time model review reports, with everyone calling Opus 4.7 the "strongest model," leading to terms like "humanity is finished" and "unemployment alert" once again making waves.

But let's take a look at what Anthropic actually released.

This release's tone is actually quite unusual.

In the announcement, Anthropic directly stated: Opus 4.7's capabilities are not as good as those of Claude Mythos Preview—where Mythos is only available to a few partners such as Apple, Google, Microsoft, Nvidia, and is not accessible to ordinary developers and users.

Moreover, what is even more noteworthy than this rhetoric is that it is not only weaker than the legendary Mythos, but it is actually weaker in some key capabilities compared to the previous generation model.

An abnormal number in Opus 4.7's performance table: Long Context Benchmark MRCR v2 @1M dropped from 78.3% in Opus 4.6 to 32.2%, a plummet of 46 percentage points.

Very rarely does a flagship model iteration cut its own ace capability in half.

And this was a choice it made on its own.

So, as everyone continues to blindly praise each of its models as the "strongest," they are actually falling behind Anthropic's own pace!

Claude Opus 4.7 Review: Is It Worthy of the Title of Strongest Model?

It doesn't even care to address this car-wash issue

Opus 4.7 was a release that never intended to be the "strongest model." It was a release with clear trade-offs, a "precision knife" approach, different from the various release strategies of other top model manufacturers in the past. It is also a new direction that top manufacturers are collectively turning to today, as they clearly feel that the "great leap forward" of the model itself is no longer sustainable—Anthropic has to some extent been moving closer to the release strategies of companies like Apple and Microsoft in their very mature product commercialization stage.

This may be the real significance of 4.7.

1. Coding Ability: Real Improvement Behind the Numbers

To better understand these changes, the best way is naturally to first take a close look at what it has actually brought this time.

Here is the complete rundown of the Opus 4.7 release—what has improved, what has deteriorated, developer feedback firsthand, and whether migration is necessary.

Official Announcement: https://www.anthropic.com/news/claude-opus-4-7

Coding performance is the centerpiece of this release of Opus 4.7.

SWE-bench Verified (500 real GitHub issues, requiring models to produce patches that pass tests) has increased from 80.8% in Opus 4.6 to 87.6% in Opus 4.7, a nearly 7 percentage point improvement, making it the top performer among publicly available models. Compared to Gemini 3.1 Pro's 80.6%, the difference is significant.

SWE-bench Pro is a more challenging version, covering a complete engineering pipeline in four programming languages. Opus 4.7 has jumped from 53.4% to 64.3%, an 11 percentage point increase. Compared to GPT-5.4's 57.7% and Gemini 3.1 Pro's 54.2%, Opus 4.7 is clearly ahead in this benchmark.

CursorBench is a practical benchmark from Cursor, specifically measuring a model's programming assistance quality in a real IDE environment. Opus 4.6 scored 58%, while Opus 4.7 jumped to 70%, a 12 percentage point improvement. Cursor co-founder Michael Truell stated in the official announcement, "This is a significant leap in capabilities, providing stronger creative reasoning when tackling challenges."

Partner Tested Data:

· Rakuten: The number of production tasks resolved by Opus 4.7 is three times that of Opus 4.6, with double-digit increases in code quality and test quality ratings

· Factory: Task success rate increased by 10-15%, significantly reducing mid-training failures

· Cognition (Devin's company): Model "can work continuously for hours without disconnecting"

· CodeRabbit: Recall rate increased by over 10%, "slightly faster than GPT-5.4 xhigh mode"

· Bolt: In longer application build tasks, Opus 4.7 outperformed Opus 4.6, "showing up to a 10% improvement in the best case scenario, without the regression issues seen in the past"

· Terminal-Bench 2.0: Opus 4.7 addressed three tasks that no previous Claude model (or competitor) could handle, including one that required cross-repository multi-file reasoning to fix a race condition

These datasets point in one direction: Opus 4.7 has shown significant improvement in long-duration, cross-file, context-maintaining complex programming tasks. This directly addresses the biggest user complaints about Opus 4.6 over the past two months—tasks giving up halfway through execution and getting lost with multi-file bugs.

II. Visual Capability: The Most Underestimated Improvement in This Release

Visual accuracy benchmark XBOW jumped from 54.5% to 98.5%. This is not an incremental improvement, but a reconstruction-level leap.

Specific spec changes:

· Maximum image resolution increased from around 1.15 million pixels (longest edge 1,568 pixels) to approximately 3.75 million pixels (longest edge 2,576 pixels), over 3 times that of the previous generation

· Model coordinates now correspond 1:1 with actual pixels, eliminating the need for manual scaling factor conversion in computer vision tasks

· CharXiv Visual Reasoning Benchmark: Without Tools 82.1%, With Tools 91.0%

What are the substantial implications of this?

For the computer use product team, this upgrade could be decisive. The computer use in the Opus 4.6 era was in a state of "able to do demo but not ready for production" — the miss click rate was too high and unpredictable. A 98.5% visual accuracy means that this feature has, for the first time, reached the threshold for reliable deployment. Several tech blogs directly stated in their reviews: if you postponed your computer use product plan because of the high miss click rate in Opus 4.6, 4.7 has cleared that obstacle.

Firsthand Feedback on Reddit (r/ClaudeAI): Some users mentioned, "The enhancement in visual ability is crucial. I've done many edge projects before, trying to make the model iteratively improve its output in a visual feedback loop, the effect has always been chaotic. I'm really looking forward to how 4.7 will address this issue."

In addition to computer use, other benefiting scenarios include: document scanning analysis (able to read smaller fonts, recognize finer details in charts), screenshot understanding, dashboard applications, and complex PDF processing.

Cost Consideration: Higher-resolution images will consume more tokens. If your application scenario does not require high image detail, it is recommended to downsample before input.

III. The Biggest Setback: Long-Context Collapse

MRCR v2 @1M (Million-Token Long-Context Recall):

· 4.6: 78.3%

· 4.7: 32.2%

A plummet of 46 percentage points, dropping from nearly 80% to one-third.

This drop is almost unprecedented in the flagship model iteration history. MRCR v2 was a capability heavily promoted by Anthropic in the Opus 4.6 era — at that time, Anthropic's exact words were "a qualitative change occurred at a magnitude of a context in which a model is actually usable." By 4.7, this "qualitative change" has directly vanished.

Why is this happening? The Tokenizer has been changed.

Opus 4.7 uses a new tokenizer, and the same input text will now result in approximately 1.0-1.35 times the number of tokens, with the exact multiplier varying depending on the content type.

The immediate ramifications are:

· The nominal 200K/1M context window is still present, but the same amount of text now takes up less space.

· The actual token consumption for long-task agent workflows has increased by about 35%.

· Pricing remains the same (input $5, output $25 per million tokens), but the actual usage cost has risen.

Anthropic's official statement is that the new tokenizer "has improved text processing efficiency," but benchmark data shows a significant regression in long-context scenarios.

Search capabilities have also regressed:

· BrowseComp (Web Deep Information Retrieval): Opus 4.6 was at 83.7% while Opus 4.7 is at 79.3%.

· GPT-5.4 Pro scored 89.3% in this area, Gemini 3.1 Pro scored 85.9%, and Opus 4.7 currently ranks last among the major competing models.

Search and long text happen to be the most common scenarios for many enterprise users.

Firsthand developer feedback from Hacker News (post with 275 upvotes, 215 comments, source: HN discussion):

"Turning off adaptive thinking and manually cranking up the effort slider is what got me back to baseline performance. Phrases like 'it looks good in our internal tests' are no longer sufficient; everyone is seeing the same issue.""In 4.7, human-readable reasoning token summaries are no longer included in the output by default; you have to add display: summarized to the API request to get them back."

These are all issues reported by actual users. However, this is also a choice made proactively by Anthropic.

Four, New Behavioral Trait: Self-Validation and More Literal Instruction Following

A noteworthy statement in the Opus 4.7 official announcement is: The model validates its output before reporting results.

Hex's technical team provided a specific example during testing: when data is missing, Opus 4.7 will truthfully report "data does not exist" instead of providing a seemingly reasonable but actually fabricated answer—a pitfall that Opus 4.6 would fall into. The fintech platform Block's assessment of this was: "It can detect its own logic errors during the planning phase, speeding up execution, showing a clear improvement over the previous Claude model."

However, self-validation has brought about another associated behavioral change: Opus 4.7 interprets instructions more literally.

This poses a significant migration risk. If you meticulously tuned prompts for Opus 4.6, 4.7 may not "read between the lines" like 4.6 would, but strictly follow the literal meaning you have written. Anthropic explicitly mentioned this in the official migration guide and recommended conducting regression testing on key prompts before deploying 4.7.

A practical reference point from Hex's CTO: For the low-effort tier, Opus 4.7 performs roughly equivalent to the mid-effort tier of Opus 4.6.

Five, Reasoning Control Mechanism: xhigh, Task Budgets, and /ultrareview

An event occurred with Opus 4.6 that affected user trust: on February 9, it switched to adaptive thinking as the default mode, and on March 3, the official default reasoning depth of Claude Code was lowered from the highest tier to medium, citing a need to "balance intelligence, latency, and cost." This event, dubbed "the intelligence gate," garnered widespread attention after a senior director at AMD questioned it on GitHub.

Opus 4.7's response was to give users more explicit control over reasoning depth.

xhigh effort tier: A new reasoning intensity level situated between the existing high and max levels. Claude Code has now updated all planned defaults to xhigh.

However, the developer community has a direct question about xhigh, as stated by a Reddit user: "Opus 4.6 defaults to medium, and 4.7 defaults to xhigh. I'm curious about the reasoning behind this decision because raising the effort tier obviously results in more token consumption."

In other words, what users see as a "return control to the user" fix is actually an increase in the default tier, meaning the same task now requires burning more tokens. Coupled with the tokenizer changes, this is a double cost increase.

Task Budgets (In Public Beta): A token budget control mechanism for long tasks. Developers set a total token budget (minimum 20K), and the model can dynamically see the remaining balance during execution to allocate resources accordingly. This is to prevent stopping midway due to token overspending and avoid unnecessary computation waste.

Claude Code New /ultrareview Command: A special code review session focused on bug fixing and design issues, running a deep review once, with Pro and Max users receiving 3 free sessions per month.

Auto Mode Open to Max Users: Previously only available in the Enterprise plan, now also accessible to Max users. In auto mode, Claude can make decisions autonomously, reducing the need to interrupt users for input. Boris Cherny, the head of the Claude Code team, stated: "Give Claude a task, let it run, and come back to verified results."

Section Six: Benchmark Overview - Wins and Losses

Below are the current key benchmark data released (source: Anthropic Official System Card and Partner Evaluations).

Programming and Engineering (Opus 4.7 Leading)

Vision and Multimodal (Opus 4.7 Significantly Leading)

Knowledge Work (Opus 4.7 Leading)

Comprehensive Evaluation (Opus 4.7 Clearly Steps Up)

General Reasoning (Three Basics Basically Align)

This benchmark has become saturated and is no longer an effective competitive watershed.

Research Task Type (GPT-5.4 Leads, Opus 4.7 Falls Back)

Long-Form Context (Opus 4.7 Significantly Regresses)

Summary of Model Selection Logic: In the areas of programming, engineering agent, vision, and financial legal knowledge work, Opus 4.7 has a clear advantage; for research-intensive tasks and open-network retrieval, GPT-5.4 is stronger; in long-form context scenarios, Opus 4.7 falls far behind its predecessor, which is the most concerning point.

Section Seven: Security Barrier – Mythos's Milestone

This section is often overlooked as a "security boilerplate statement" in press releases, but it is key to understanding Anthropic's current strategy.

On April 7th, Anthropic announced Project Glasswing: making the Claude Mythos Preview available to nine partners, including Apple, Google, Microsoft, Nvidia, Amazon, Cisco, CrowdStrike, JPMorgan Chase, and Broadcom, specifically for defensive cybersecurity scenarios.

Mythos is Anthropic's most powerful model to date. According to The Hacker News, it can autonomously discover zero-day vulnerabilities, identifying thousands of previously unknown vulnerabilities in major operating systems and browsers. However, due to this capability, it has also been deemed to have significant misuse risks and is therefore not publicly released.

Opus 4.7 is the first test sample along this line. During the training phase, Anthropic actively reduced the model's ability to launch cybersecurity attacks (while trying to retain defensive capabilities) and implemented a real-time barrier system for automatically detecting and blocking high-risk cybersecurity requests. The original announcement stated: "We will learn from the actual deployment of Opus 4.7 to determine the effectiveness of this barrier before deciding whether to extend it to Mythos-level models."

In other words, every developer using Opus 4.7 is helping Anthropic calibrate the security fence.

Gizmodo's Review: This release takes on a "bold marketing strategy—proactively promoting their new model as 'less generally capable than other options,'" which is extremely rare in flagship releases.

If security professionals need to use Opus 4.7 for legitimate penetration testing, vulnerability research, or red teaming, they need to apply to join the Cyber Verification Program.

8. Pricing and Migration: Nominal Stays Put, Real Cost Goes Up

Pricing: Input at $5/million tokens, output at $25/million tokens, the same as Opus 4.6. API model ID is claude-opus-4-7. Supported platforms include Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, with GitHub Copilot also onboarded.

However, as mentioned earlier, the tokenizer change now results in generating approximately 1.0-1.35 times more tokens for the same input, coupled with the additional thinking tokens under higher default effort levels, the actual cost for a long task agent workflow may be 2-3 times that of Opus 4.6 under equivalent settings.

Anthropic has also reduced the Claude Code's cache TTL from one hour to five minutes—meaning if you step away from your computer for more than five minutes and come back, the context cache will expire, requiring a reload, which speeds up token consumption. The Reddit community has seen many users complain that "the quota burns faster than a waterfall."

List of Disruptive Changes for Existing Opus 4.6 Users:

1. Extended Thinking Budgets parameter has been removed; passing it will return a 400 error, and adaptive thinking mode should be used instead

2. Sampling parameters such as temperature, top_p, top_k have been removed; prompting should be used to control output behavior

3. Stricter Literal Instruction Following - The prompt fine-tuned for Opus 4.6 needs to be retested; direct model ID swap is not allowed.

4. Tokenizer changes have altered token counts. It is recommended to first run samples on real traffic before proceeding with a full migration.

5. The default output no longer includes inference token summaries. You need to explicitly set display: summarized to retrieve them.

Best Practice: The official Anthropic migration guide suggests running Opus 4.7 on representative production traffic before the final switch, comparing token consumption and task quality before making a decision.

Precision in execution can be terrifying.

Opus 4.7 is a targeted upgrade with clear advantages but also considerable trade-offs. Moreover, these are all designed by Anthropic itself, and to a large extent, you have to foot the bill for it.

The bright side of this model's progress:

· 87.6% on SWE-bench Verified, 64.3% on SWE-bench Pro, 70% on CursorBench, and a 3x increase in Rakuten's production tasks - these are the perceptible improvements in programming capability within a production environment

· Visual capabilities rebuilding (XBOW 54.5% → 98.5%, 3x increase in resolution, pixel-perfect 1:1 mapping), enabling computer use for reliable deployment for the first time

· xhigh tier, task budgets, /ultrareview - an explicit response to the "dumbing down" barrier

· 90.9% on BigLaw, 64.4% on Finance Agent, clearly leading in specialized knowledge work like financial legal matters

Aspects that have been relinquished:

· MRCR v2 @1M dropped from 78.3% to 32.2%, nearly halving long-context capabilities

· BrowseComp drops from 83.7% to 79.3%, search capability overtaken by both GPT-5.4 and Gemini 3.1 Pro

· tokenizer changes + increased default effort + shortened cache TTL = triple stealth price hike

· Mythos holding steady, indicating Anthropic has even stronger cards in hand but is not playing them

This release is the most authentic yet, not the "strongest model" nor the "strongest public model," but rather: an iteration with clear trade-offs.

The latest news is that Claude Code has already hit $2.5 billion in annual revenue in February. Opus 4.7 is the next move in this online lineage.

Coding and vision are additions, long context and search are subtractions, the price remains nominal but the bill is rising. Anthropic is working on a balance with Opus 4.7—both to address the trust damage left by Opus 4.6 and to conduct a real-world security exercise in preparation for the broader opening of Mythos-level models in the future. More importantly, it aims to fully capitalize on its current leading position, translating user preference for its products into the inertia that cannot be escaped even through generations of imperfect but indispensable products, then establishing the kind of love-hate sticky user experience with true commercial value that mature companies like Apple have achieved, and building a truly valuable ecosystem.

Original Article Link

Federal Reserve Chairman Waller's debut featured a significant slimming statement, the cancellation of forward guidance, refusal to submit the dot plot, and the establishment of five working groups, vowing to uphold the 2% inflation target, which triggered a sharp decline in U.S. stocks and a surge ...

From Disruptor to Shadow Market: The Crypto Market is Becoming a Colony of Traditional Finance

"Coin-stock linkage" has evolved from the early stage of macro correlation and one-way penetration of emotional funds to the current 3.0 stage, where on-chain perpetual contracts provide extended trading hours and emotional signal value for traditional assets 24/7, and participate in Pre-IPO pricing...

Dalio's important long article: How to position in the current market environment?

Do not confuse the excitement for new technologies with whether those tech stocks are attractive.

OKX Star analyzes Binance's competitive advantages: when regulation levels the playing field, competition has just begun

OKX founder Star published a lengthy article, systematically analyzing Binance's competitive advantages over the years: regulatory arbitrage, speculative narrative cycles, social media control, and superficial compliance, stating that the essence of these advantages is not product capability, but ra...