ImmerseGen: Agent-guided, Lightweight X Highly Realistic Next-generation VR Scene Generation

24/07/2025

3 main points
✔️ Lightweight geometry and RGBA textures efficiently generate highly immersive 3D VR scenes
✔️ Agents select and place assets for visual consistency and spatial accuracy
✔️ Dynamic effects and environmental sounds provide a multi-sensory real-time VR experience

ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies
written by Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, Yuewen Ma
(Submitted on 17 Jun 2025 (v1), last revised 18 Jun 2025 (this version, v2))
Comments: Project webpage: this https URL
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

This paper proposes ImmerseGen, a new approach to the automatic generation of high-quality 3D scenes in immersive VR spaces.

Unlike conventional methods that rely on complex high-polygon modeling and 3D Gaussian representations, ImmerseGen uses lightweight geometry proxies and high-quality RGBA textures for hierarchical scene composition. The core technologies are viewpoint-centric terrain texture generation based on user-input text prompts, simplified mid- and foreground object placement, and integration of multisensory experiences through natural sounds and dynamic effects. In particular, by enabling asset selection and placement by agents, this method solves the bottlenecks of conventional methods, such as lack of spatial understanding and redundant asset generation.

In addition, this method achieves rendering performance of nearly 80 FPS on mobile VR devices with Snapdragon XR2, achieving both real-time performance and immersive experience at a high level. Experiments showed superior results in aesthetic quality, realism, and consistency with text compared to prior methods.

Proposed Method

At the core of ImmerseGen is an agent-driven generation pipeline that builds a hierarchical 3D world from text input.

First, it searches for appropriate terrain templates in response to user prompts, and then textures the terrain and sky at high resolution through viewpoint-centric UV mapping. In this process, a depth-conditional diffusion model by ControlNet is used to generate panoramic images that conform to the terrain geometry.

Next, a Vision-Language Model (VLM)-based agent selects the mid- and foreground objects and determines a proxy mesh based on their respective distances. A billboard-type texture will be used for the midground and an alpha texture for the foreground against a low-poly mesh. For placement, a semantic analysis of the grid overlaid on the image is used to determine appropriate locations from coarse to fine.

Finally, each asset will be synthesized with contextual RGBA textures to blend naturally with the background. In addition, visual effects such as wind, rain, and moving clouds, as well as sounds such as birds and water sounds, are added to provide viewers with a multi-sensory immersive experience.

Experimentation

In order to verify the effectiveness of ImmerseGen from multiple perspectives, comparative experiments were conducted with existing scene generation methods such as Infinigen, DreamScene360, WonderWorld, and LayerPano3D.

Evaluation metrics used were textual consistency (CLIP-Score), aesthetic quality (CLIP-Aesthetic), and VLM-based visual score (QA-Quality). The results confirmed the beauty and consistency of the generated scenes, with ImmerseGen scoring highest in CLIP-Aesthetic and QA-Quality.

In addition, execution performance on VR devices was also compared, with this method achieving an average of 79 FPS, while other methods only achieve 8 to 14 FPS. Ablation studies showed that terrain depth adaptation and grid analysis of asset placement had a noticeable impact on rendering quality. The user study also showed that the majority of subjects rated scenes with ImmerseGen as preferable to other methods.

These results confirm that lightweight proxy structures and agent-driven design contribute to the generation of practical and visually superior immersive VR spaces.

Categories related to this article

nakata

ImmerseGen: Agent-guided, Lightweight X Highly Realistic Next-generation VR Scene Generation

Overview

Proposed Method

Experimentation

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation