The Challenge Of "Embodied Web Agents," The Next Generation AI That Fuses The Physical And Digital

05/07/2025

3 main points✔️ Proposes "Embodied Web Agents" that act by integrating physical environment and web information
✔️ New simulation environment combines realistic 3D environment and web interface
✔️ Experiments show large performance differences between human and AI models, revealing challenges of integrated intelligence

Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence
written by　Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang
(Submitted on 18 Jun 2025 (v1), last revised 20 Jun 2025 (this version, v2))
Comments: Published on arxiv.
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper focuses on the fact that conventional AI agents handle the two functions of "action in physical space" and "knowledge use on the web" in isolation, and proposes a new framework to integrate them. The researchers have formulated the concept of Embodied Web Agents (EWAs).

These agents combine the ability to perceive and manipulate the real world with the ability to dynamically acquire and reason about information online. This allows them to consistently handle complex tasks, for example, checking ingredients in a real kitchen, searching for recipes from the web, and cooking.

In the paper, we built an integrated simulation environment that combines realistic 3D scenes with a web interface and evaluated the agent's capabilities in five domains: cooking, shopping, sightseeing, navigation, and location estimation. The results show that current AI models still suffer from significant performance gaps compared to humans, indicating the challenges and potential of integrated physical and digital intelligence.

Proposed Methodology

The proposed "Embodied Web Agents" designed a unique task environment to handle the physical and digital environments in an integrated manner.

This environment consists of (1) an outdoor space using Google Street View and Earth, (2) a high-definition indoor simulation using AI2-THOR, and (3) multiple web interfaces including recipe sites, maps, and encyclopedias.

To integrate these, state space (physical and digital states), action space (movement, manipulation, and web operations), and observation space (visual and text input) are explicitly defined, and the agent proceeds with its tasks while switching freely between environments. In addition, benchmarks are formulated with a variety of scenarios. Specifically, through approximately 1,500 tasks, we systematically evaluate cross-domain reasoning skills in cooking, shopping, and travel.

Through these designs, the system is designed to test not only the mere execution of actions, but also advanced abilities such as planning the coordination of actions and knowledge, and confirming the consistency of perceptual and textual information.

Experiments

In the experiments, we used the latest large-scale language models (GPT-4o, Gemini 2.0 Flash, Qwen-VL-Plus, InternVL2.5) in the proposed benchmark and compared their performance with humans. Four evaluation metrics were used: overall accuracy, web task accuracy, physical task accuracy, and task completion rate.

As a result, GPT-4o showed the highest accuracy in the navigation, shopping, and travel tasks, but the overall accuracy was only in the 30% range at most. In particular, a relatively high success rate was achieved in the aspect of acquiring web information, but significant challenges remained with actions in the physical environment and their integration.

In addition, in the cooking task, inference and execution from visual information was significantly more difficult, with an overall accuracy of only about 6%. Error analysis showed that "cross-domain errors," which are failures to switch between environments and maintain information consistency, rather than separation of actions, accounted for more than 60% of the total errors, highlighting a bottleneck in integrated intelligence.

Categories related to this article

nakata

The Challenge Of "Embodied Web Agents," The Next Generation AI That Fuses The Physical And Digital

Summary

Proposed Methodology

Experiments

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

Skywork UniPic: Next-generation Multimodal Model That Integrates Image Understanding, Generation, And Editing With High Efficiency

Skywork UniPic: Next-generation Multimodal Model That Integrates Image Understanding, Generation, An ...

Seed Diffusion Preview: Next-generation Code Generation Model That Combines Fast Inference And High Performance

Seed Diffusion Preview: Next-generation Code Generation Model That Combines Fast Inference And High ...

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Effectiveness?

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Ef ...