What Is DualTHOR? Next Generation Simulator For Dual-Arm Robots' Adaptability To Reality

28/07/2025

3 main points
✔️ DualTHOR, a highly accurate simulator for realistic dual-armed humanoid manipulation
✔️ Contingency mechanism allows realistic task evaluation including action failure
✔️ Current VLMs are weak in dual-arm coordination and replanning, and DualTHOR has revealed their limitations

DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning
written by Boyu Li, Siyuan He, Hang Xu, Haoqi Yuan, Yu Zang, Liwei Hu, Junpeng Yue, Zhenxiong Jiang, Pengbo Hu, Börje F. Karlsson, Yehui Tang, Zongqing Lu
(Submitted on 19 Jun 2025)
Comments: Published on arxiv.
Subjects: Robotics (cs.RO)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper proposes DualTHOR, a highly accurate simulation platform for simulating the task execution of a dual-armed humanoid robot in the real world and evaluating its planning ability and robustness. Many conventional simulators are designed around wheeled or single-armed robots and tend to omit physical uncertainties and possible failures. As a result, their application to the real world has been limited.

DualTHOR was built as an extension of AI2-THOR, with a diverse task suite for dual-armed robots (Unitree H1 and Agibot X1), physics-based inverse kinematics, continuous motion control, and also mimics failures during execution (breakage, spills, etc.) through a " contingency mechanism. This mechanism is expected to help bridge the gap between intentions during planning and uncertainties in reality, thus fostering agents that are highly adaptable to the real world.

A baseline evaluation using the latest Vision-Language Model (VLM) was also conducted, showing that the current model still has issues in handling dual-arm tasks and uncertainty.

Proposed Methodology

DualTHOR follows the basic design of AI2-THOR, but assumes a humanoid dual-armed robot. It consists mainly of the following three elements.

The first is a dedicated task design for the dual-arm task. Assuming actions that are difficult to achieve with one arm (e.g., holding a cup with one hand and pouring water with the other), a wealth of in-home tasks that require complex operations are defined. Tasks are categorized as "dual-arm required," "dual-arm optional," and "single-arm tasks," a design that enhances the generalizability and flexibility of the model.

The second is physically continuous action control. Instead of the conventional "instantaneous state transitions," the Unity engine and inverse kinematics (IK) are utilized to reproduce smooth movements; IK is configured differently for X1 and H1, with X1 providing single arm control and H1 providing coordinated dual arm control with full body coordination.

Third is the contingency mechanism. Each action is designed to have a certain probability of failure (e.g., cup breaking, liquid spilling), creating the need for the model to come up with a recovery plan. This mechanism makes it possible to evaluate the model, including its ability to rebuild in the event of plan failure.

Experiment

In the experiment, models with 10 different rooms, 68 different objects, and 356 tasks were evaluated using large-scale VLMs such as GPT-4o and Gemini 1.5 Pro, open models such as Qwen2.5-VL, and structured prompts such as DAG-Plan.

The evaluation was divided into three categories of "dual-arm required," "dual-arm optional," and "single-arm," as well as three difficulty levels (Easy, Medium, and Hard) with different behavioral success rates to compare success rates under a variety of conditions.

As a result, the existing VLMs had lower success rates for the dual-arm required task in all categories, and their performance was significantly lower, especially under complex tasks and contingencies. For example, even with DAG-Plan, the success rate in the twin-arm task was only about 40%, and they observed examples of inability to adequately handle dynamic replanning and arm mutual interference.

The experiment also used "continuous physical rendering" (e.g., depicting the gradual accumulation of water) to confirm that the VLM could follow visual changes and update its understanding and planning. With this kind of design, DualTHOR exposes the limitations of current technology and presents a clear challenge for the future development of VLMs.

Categories related to this article

nakata

What Is DualTHOR? Next Generation Simulator For Dual-Arm Robots' Adaptability To Reality

Summary

Proposed Methodology

Experiment

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation