DeepSeek-V4-Pro Review: Benchmark Scores, Strengths & Weaknesses

DeepSeek-V4-Pro stands as the most advanced flagship reasoning model released by DeepSeek to date. Supported by comprehensive benchmark test data collected by DataLearner, this model boasts outstanding competitive strength with prominent core advantages in code generation and competitive programming. Nevertheless, it still faces noticeable limitations in comprehensive cross-domain reasoning capacity. This evaluation conducts multi-dimensional comparisons against peer models including GLM 5.1 and Kimi K2.6, alongside vertical performance checks with earlier DeepSeek versions, to elaborate on its strengths, weaknesses and applicable business scenarios.

Dominant Leading Performance in Programming Capacity

Programming competence constitutes the core competitive edge of DeepSeek-V4-Pro, and it establishes an obvious gap against most contemporary domestic AI models.

LiveCodeBench serves as an authoritative dynamic benchmark that truly reflects practical coding capability. Running under deep thinking mode, DeepSeek-V4-Pro achieves a high score of 93.50, securing the top rank among 118 tested models. In horizontal comparison, it outperforms Kimi K2.6 which gains 89.60, with a gap of nearly 4 percentage points. From the perspective of iterative upgrade, the model witnesses remarkable progress compared with historical versions. It is over 10 points higher than DeepSeek V3.2 scoring 83.30, and creates a nearly 20-point improvement when matched with V3.1 at 74.80 and R1-0528 at 73.30. Such remarkable leap indicates systematic optimization and substantial enhancement in overall coding performance.

In competitive programming assessment based on Codeforces standards, DeepSeek-V4-Pro reaches 3206 points under deep thinking mode. The score rises sharply by more than 800 points from 2386 points of V3.2. Scoring rules of competitive programming present non-linear characteristics. Scores exceeding 3000 prove that the model owns problem-solving proficiency close to top human programmers, ranking within the first echelon globally.

Test results from SWE-bench series reveal differentiated performance in software engineering scenarios. On SWE-bench Verified, DeepSeek-V4-Pro gets 80.60, slightly surpassing Kimi K2.6’s 80.20 with trivial disparity. When it comes to multilingual code repository defect repair tasks in SWE-bench Multilingual, it scores 76.20, falling marginally behind Kimi K2.6’s 76.70. For the more challenging public test set SWE-Bench Pro, the model only obtains 55.40, lower than GLM 5.1’s 58.40 and Kimi K2.6’s 58.60. The data demonstrates that DeepSeek-V4-Pro maintains solid superiority in routine code generation and competitive programming, while its capacity in tackling sophisticated software engineering flaws is comparable or slightly inferior to rival products.

Top-tier Mathematical Reasoning Ability Among Domestic Models

DeepSeek-V4-Pro delivers excellent mathematical reasoning performance, ranking among the elite level of domestic mainstream models.

IMO-AnswerBench targets Olympiad-level mathematical questions, which effectively distinguishes the in-depth logical reasoning capability of AI models. DeepSeek-V4-Pro scores 89.80 on this benchmark, outrunning GLM 5.1 at 83.80 and Kimi K2.6 at 86.00 by 6 points and 4 points respectively. It takes the third place among 17 participating models, gaining world-class mathematical reasoning competence.

GPQA Diamond focuses on academic scientific reasoning and postgraduate-level knowledge Q&A. DeepSeek-V4-Pro earns 90.10, basically keeping pace with Kimi K2.6’s 90.50, and both models far outstrip GLM 5.1 with a score of 86.20.

Vertical comparison with earlier generations also verifies obvious progress. The score of GPQA Diamond jumps from 82.40 of V3.2 to the current level, while V3.1 and R1-0528 only record 80.10 and 81.00 separately. The latest version creates an approximate 8-point advantage, marking one of the most striking breakthroughs in this iteration update.

Obvious Shortage in Ultra-hard Cross-domain Comprehensive Reasoning

HLE, short for Humanity's Last Exam, is widely recognized as the toughest benchmark for comprehensive knowledge reasoning, designed to examine the knowledge boundary of intelligent models. Equipped with deep thinking mode and network access tools, DeepSeek-V4-Pro reaches the highest score of 48.20. In the same test environment, GLM 5.1 hits 52.30 and Kimi K2.6 achieves 54.00. The two competing models lead by 4 to 6 percentage points, forming a meaningful performance gap in high-difficulty cross-domain reasoning.

Though falling behind current rivals, the model realizes phenomenal self-improvement. Its HLE score nearly doubles compared with 25.10 of V3.2 and 15.90 of V3.1. The tremendous growth reflects effective capability upgrade, yet there is still ample room for improvement in handling complex cross-field knowledge reasoning tasks.

Remarkably Upgraded Agent and Tool Application Performance

Agent capability becomes another highlight of DeepSeek-V4-Pro, especially in information retrieval and terminal tool operation.

BrowseComp evaluates complex information retrieval efficiency with network tools supported. DeepSeek-V4-Pro gains 83.40, ranking slightly ahead of Kimi K2.6 at 83.20 and GLM 5.1 at 79.30. Terminal Bench 2.0 tests practical operation proficiency via command-line terminals, where the model scores 67.90, surpassing 66.70 of Kimi K2.6 and 63.50 of GLM 5.1.

Compared with historical versions, Agent related capacity witnesses explosive growth. BrowseComp score surges from 51.40 of V3.2 to 83.40, and Terminal Bench 2.0 rises from 46.40 to 67.90. Both indicators achieve growth rate over 40%, proving that enhancing autonomous Agent operation serves as a core optimization direction of this new model.

Reasonable Pricing and Practical Cost Performance

The official API charging standards vary among the three mainstream models, bringing differentiated cost advantages for diverse application scenarios. DeepSeek-V4-Pro charges 1.74 US dollars per million input tokens and 3.48 US dollars per million output tokens. GLM 5.1 sets input price at 1.40 US dollars per million tokens and output price at 4.40 US dollars per million tokens. Kimi K2.6 owns the lowest input cost at 0.95 US dollars per million tokens, with output price reaching 4.00 US dollars per million tokens.

In terms of comprehensive cost performance, Kimi K2.6 gains edge on input consumption, but its high output cost leads to higher overall expenditure when processing output-intensive tasks such as lengthy code writing and complicated mathematical problem solving. GLM 5.1 has moderate input price yet the most expensive output billing. Positioned at medium pricing level, DeepSeek-V4-Pro possesses unparalleled programming and mathematical reasoning strength, making it cost-effective for code-centric development and computing business.

Overall Evaluation and Applicable Scenarios

DeepSeek-V4-Pro presents clear positioning as a specialized powerful model in domestic AI markets. It establishes absolute predominance in competitive programming and automatic code generation, and maintains first-class mathematical reasoning strength. Besides, the autonomous Agent tool utilization capability obtains substantial improvement compared with old versions.

Meanwhile, the model has definite drawbacks. It cannot match GLM 5.1 and Kimi K2.6 in extreme cross-domain comprehensive reasoning measured by HLE benchmark, and shows slight weakness in repairing sophisticated software engineering defects.

For business focusing on code creation, mathematical calculation and technical document arrangement, DeepSeek-V4-Pro acts as the optimal choice among domestic AI models. When scenarios require broad-spectrum knowledge reasoning and multi-field comprehensive analysis, Kimi K2.6 is more suitable to satisfy practical demands.

4sapi works as a professional API gateway, facilitating convenient access and stable scheduling of diverse mainstream large models for enterprise developers.

DeepSeek-V4-Pro Review: Benchmark Scores, Strengths & Weaknesses

Dominant Leading Performance in Programming Capacity

Top-tier Mathematical Reasoning Ability Among Domestic Models

Obvious Shortage in Ultra-hard Cross-domain Comprehensive Reasoning

Remarkably Upgraded Agent and Tool Application Performance

Reasonable Pricing and Practical Cost Performance

Overall Evaluation and Applicable Scenarios

Recommended reading

Claude Fable 5 API Guide: Fix Limits & Deploy Stable AI

Trae vs QoderWork vs ZCode: China AI Agent Guide

Claude Code Setup Guide with DeepSeek AI Integration

Build Pro AI Agents with Claude Fable 5