Catch up on the latest AI articles

MCP-Bench Opens Up A New Wave Of LLM Agent Evaluation! Challenges For Complex Tasks And Real-World Scenarios

MCP-Bench Opens Up A New Wave Of LLM Agent Evaluation! Challenges For Complex Tasks And Real-World Scenarios

3 main points
✔️ MCP-Bench is a benchmark that leverages 28 servers and 250 tools to evaluate LLM on realistic complex tasks
✔️ Designing tasks with fuzzy instructions and cross-domain dependencies to measure LLM capability from multiple aspects
✔️ Experimental results show that while basic execution accuracy converges, long-term significant differences remained in planning and reasoning ability.

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
written by Zhenting WangQi ChangHemani PatelShashank BijuCheng-En WuQuan LiuAolin DingAlireza RezazadehAnkit ShahYujia BaoEugene Siow
(Submitted on 28 Aug 2025)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In this paper,a new benchmark, MCP-Bench, is proposed to assess the ability of LLMs to perform realistic and complex tasks.

Traditional benchmarks often assume a single API call or an artificially connected tool chain, and do not adequately measure the ability to respond to long-term planning and fuzzy instructions across multiple tools that are needed in the real world.

MCP-Bench overcomes this challenge by leveraging the Model Context Protocol (MCP) to combine 28 MCP servers with 250 real-world tools.
It reproduces realistic tasks in areas as diverse as finance, scientific computing, travel planning, and academic search, and evaluates whether agents can correctly discover tools, understand dependencies, and build complex workflows.

The benchmark provides a framework for systematically testing capabilities such as tool schema understanding, long-term planning, information rationale presentation, and cross-domain coordination, and reveals the challenges that remain through large-scale experiments on 20 different advanced LLMs.

Proposed Methodology

The proposed methodology of MCP-Bench is unique in that it measures the multifaceted capabilities of LLM agents while reproducing realistic tool usage scenarios.

First, multiple tools provided through the MCP server are collected, and the input-output dependencies are analyzed.
It then synthesizes natural language tasks based on the dependencies and further transforms them into "fuzzy descriptions" that omit explicit tool names and procedures to test the agent's ability to infer the appropriate tool from the context.

The evaluation is performed in a two-tiered structure.
First, a rule-based evaluation measures tool name adequacy, schema compliance, execution success rate, and dependency compliance.
Second, LLM is used as an examiner to score task completion, information rationale presentation, appropriateness of tool selection, and consistency and efficiency of planning.

This design allows for a rigorous assessment of long-term planning and cross-domain coordination skills that cannot be measured by traditional benchmarks.

Experiments

The authors evaluated 20 advanced LLMs using MCP-Bench.

Experiments were conducted in both single-server and multi-server environments and covered 104 different complex tasks.
The results showed that the powerful group of models (e.g., GPT-5, o3, gpt-oss-120b) showed near 100% accuracy in schema understanding and tool naming accuracy, but there were significant differences in higher-order capabilities such as long-term planning, dependency recognition, and parallel processing efficiency.

The small-scale models, in particular, showed some success in the single-server environment, but their scores dropped significantly when moving to a multi-server environment, indicating weakness in their ability to maintain dependencies.
The top model, on the other hand, maintained relatively stable performance even in cross-domain and long-term workflows.

These results indicate that while the gap is shrinking in terms of mere tool call accuracy, strategic reasoning and planning are the differentiators of the current LLMs.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us