
What Is DualTHOR? Next Generation Simulator For Dual-Arm Robots' Adaptability To Reality
3 main points
✔️ DualTHOR, a highly accurate simulator for realistic dual-armed humanoid manipulation
✔️ Contingency mechanism allows realistic task evaluation including action failure
✔️ Current VLMs are weak in dual-arm coordination and replanning, and DualTHOR has revealed their limitations
DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning
written by Boyu Li, Siyuan He, Hang Xu, Haoqi Yuan, Yu Zang, Liwei Hu, Junpeng Yue, Zhenxiong Jiang, Pengbo Hu, Börje F. Karlsson, Yehui Tang, Zongqing Lu
(Submitted on 19 Jun 2025)
Comments: Published on arxiv.
Subjects: Robotics (cs.RO)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This paper proposes DualTHOR, a highly accurate simulation platform for simulating the task execution of a dual-armed humanoid robot in the real world and evaluating its planning ability and robustness. Many conventional simulators are designed around wheeled or single-armed robots and tend to omit physical uncertainties and possible failures. As a result, their application to the real world has been limited.
DualTHOR was built as an extension of AI2-THOR, with a diverse task suite for dual-armed robots (Unitree H1 and Agibot X1), physics-based inverse kinematics, continuous motion control, and also mimics failures during execution (breakage, spills, etc.) through a " contingency mechanism. This mechanism is expected to help bridge the gap between intentions during planning and uncertainties in reality, thus fostering agents that are highly adaptable to the real world.
A baseline evaluation using the latest Vision-Language Model (VLM) was also conducted, showing that the current model still has issues in handling dual-arm tasks and uncertainty.
Proposed Methodology
DualTHOR follows the basic design of AI2-THOR, but assumes a humanoid dual-armed robot. It consists mainly of the following three elements.
The first is a dedicated task design for the dual-arm task. Assuming actions that are difficult to achieve with one arm (e.g., holding a cup with one hand and pouring water with the other), a wealth of in-home tasks that require complex operations are defined. Tasks are categorized as "dual-arm required," "dual-arm optional," and "single-arm tasks," a design that enhances the generalizability and flexibility of the model.
The second is physically continuous action control. Instead of the conventional "instantaneous state transitions," the Unity engine and inverse kinematics (IK) are utilized to reproduce smooth movements; IK is configured differently for X1 and H1, with X1 providing single arm control and H1 providing coordinated dual arm control with full body coordination.
Third is the contingency mechanism. Each action is designed to have a certain probability of failure (e.g., cup breaking, liquid spilling), creating the need for the model to come up with a recovery plan. This mechanism makes it possible to evaluate the model, including its ability to rebuild in the event of plan failure.
Experiment
In the experiment, models with 10 different rooms, 68 different objects, and 356 tasks were evaluated using large-scale VLMs such as GPT-4o and Gemini 1.5 Pro, open models such as Qwen2.5-VL, and structured prompts such as DAG-Plan.
The evaluation was divided into three categories of "dual-arm required," "dual-arm optional," and "single-arm," as well as three difficulty levels (Easy, Medium, and Hard) with different behavioral success rates to compare success rates under a variety of conditions.
As a result, the existing VLMs had lower success rates for the dual-arm required task in all categories, and their performance was significantly lower, especially under complex tasks and contingencies. For example, even with DAG-Plan, the success rate in the twin-arm task was only about 40%, and they observed examples of inability to adequately handle dynamic replanning and arm mutual interference.
The experiment also used "continuous physical rendering" (e.g., depicting the gradual accumulation of water) to confirm that the VLM could follow visual changes and update its understanding and planning. With this kind of design, DualTHOR exposes the limitations of current technology and presents a clear challenge for the future development of VLMs.
Categories related to this article