Catch up on the latest AI articles

LiveMCPBench: A New Benchmark For Evaluating LLM Agents In Large Tool Environments

LiveMCPBench: A New Benchmark For Evaluating LLM Agents In Large Tool Environments

3 main points
✔️ Proposed a new benchmark LiveMCPBench to evaluate LLM agents in a large MCP tool environment 
✔️ LiveMCPTool containing 70 servers and 527 tools and an evaluation framework for the LLM-as-a-Judge method
✔️ Experiments show Claude-Sonnet- 4 achieved about 79% success rate; many models exposed performance differences and limitations

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?
written by Guozhao MoWenliang ZhongJiawei ChenXuanang ChenYaojie LuHongyu LinBen HeXianpei HanLe Sun
(Submitted on 3 Aug 2025)
Comments: Our code and data will be publicly available athis https URL

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper proposes a new benchmark, LiveMCPBench, for evaluating agent performance in large tool environments.

Conventional benchmarks assume a small number of APIs and simulated tool environments, and do not adequately reflect the diverse and dynamic tool environments of reality.
To address this issue, the authors have constructed LiveMCPTool, which consists of 70 MCP servers and 527 actual tools, utilizing a standardized interface called MCP (Model Context Protocol).
Furthermore, they introduced LiveMCPEval, which enables automated evaluation, and by using LLMs as evaluators, they achieved an 81% agreement rate with human evaluators.

We also proposed MCP Copilot Agent, which enables tool exploration and sequential use, and compared 10 state-of-the-art models.
The results showed that Claude-Sonnet-4 achieved a success rate of about 79%, while most of the models achieved only 30-50%, confirming that there are significant differences in their capabilities in a large-scale tool environment.

This study provides a realistic and reproducible basis for evaluation and lays the foundation for future agent research.

Proposed Methodology

The authors designed a framework consisting of four main elements to evaluate whether agents can effectively utilize a large set of MCP tools.

First, a variety of daily tasks were designed and 95 practical tasks in 6 domains were collected, including office work, life information, finance, travel, and shopping.
This provides real-world tasks that require time variability and integrated use of multiple tools.

Second, we built LiveMCPTool, which contains 70 servers and 527 tools that eliminate external API key dependencies and are readily available to researchers.

Third, we proposed LiveMCPEval, a method in which LLM judges the process of tool use by agents.
This allows robust evaluation in environments that include solution diversity and time dependence.

Finally, we developed the MCP Copilot Agent, which integrates tool search and sequential execution based on the ReACT strategy.
This framework overcomes the problems of API instability and small-scale nature of conventional methods and provides a realistic and reproducible evaluation environment.

Experiments

Experiments were conducted on 10 major models, including Claude-Opus-4, Claude-Sonnet-4, GPT-4.1, Gemini-2.5-Pro, and DeepSeek-V3, to compare performance using 95 tasks.

The LLM-as-a-Judge method was used for evaluation with DeepSeek-V3, and the results were compared with human evaluations.
As a result, Claude-Sonnet-4 achieved the highest success rate of 78.95%, followed by Claude-Opus-4 at 70.53%.

On the other hand, GPT-4.1 and Gemini-2.5-Pro only achieved around 40%, confirming that many models failed to find and combine tools.
In particular, tool misuse, failure to specify parameters, and "Retrieve Error," in which an appropriate tool was not found, were the main failure factors.

In addition, an analysis of the usage behavior of each model showed that the Claude system actively explored and utilized multiple tools, while other models tended to rely on a single tool.
In addition, the cost/performance tradeoff analysis identified the Claude-Sonnet-4 and Qwen2.5-72B as cost-effective models.

Based on these results, they conclude that many of the current models still have limitations in large tool environments, and that future improvements in task decomposition and dynamic planning capabilities are required.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us