TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

26/09/2025

3 main points
✔️ TriMM achieves high-quality 3D generation with cooperative multimodal coding that integrates RGB, RGBD, and point clouds
✔️ 2D/3D loss and VAE compression are introduced to enable efficient learning of both texture and geometric structures
✔️ Outperforms existing methods in standard data set evaluation, achieving high-definition even with small amounts of data Achieves high-definition 3D generation even with small amounts of data

Collaborative Multi-Modal Coding for High-Quality 3D Generation
written by Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu
(Submitted on 21 Aug 2025)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

This research proposes a new framework, TriMM, which cooperatively utilizes multiple modalities such as RGB images, RGBD images, and point clouds for the purpose of high-quality 3D generation.

Conventional 3D generation models tend to rely on a single modality, especially RGB images, and while they are rich in texture information, they have limitations such as ambiguity of geometric structures and lack of depth representation.
As a result, they had difficulty reproducing complex structures and hidden regions.

TriMM extracts RGB texture information using modality-specific encoders and geometric information from point clouds and depth data, and integrates them through "cooperative multimodal coding.
This unified latent representation is transformed into high-definition 3D assets by latent diffusion models based on a triplane structure.
Furthermore, the robustness and expressiveness of the reconstruction is improved by introducing 2D and 3D auxiliary losses.

The proposed method shows high performance even with small amounts of data, and the results are comparable to those of conventional large-scale data-dependent models.

Proposed Methodology

The core of TriMM is a method called Collaborative Multi-Modal Coding.

Here, three types of input, RGB, RGBD, and point clouds, are each processed by a dedicated encoder and projected to a three-plane representation.
RGB provides dense texture information, RGBD complements the three-dimensional effect of depth, and point clouds define elaborate geometric structures.
When integrating these pieces of information, the system is designed to maximize strengths while mutually compensating for weaknesses, especially through cross-attachment and residual connections to maintain consistency across modalities.

Furthermore, the three-plane latent diffusion model enables efficient and accurate generation on a compressed latent space.
During training, we introduced a hybrid loss function that combines 2D loss based on rendered images and depth maps with 3D loss based on SDF (Signed Distance Function).

This configuration allows for both texture sharpness and geometric structure accuracy.
It also improves the stability and efficiency of learning by combining the compression of latent representations by VAE.

Experiments

In our experiments, we used standard 3D datasets such as Objaverse, Google Scanned Objects (GSO), and OmniObject3D for evaluation.

First, in reconstruction experiments using Objaverse, TriMM outperformed single-modality methods such as RGB-only, RGBD, and point clouds, showing superior performance in terms of both texture quality (PSNR) and geometric accuracy (Chamfer Distance, F-score).
Next, the team also achieved competitive or better results than existing state-of-the-art methods when validating with unknown objects using GSO and OmniObject3D.
In particular, the integration of RGB texture, point cloud geometry, and RGBD depth was found to significantly exceed methods that rely on any single modality.

Furthermore, through ablation studies, we have shown that reconstruction loss, mixed 2D/3D supervision, and VAE implementation contribute to improved performance.
In user studies, the results generated by TriMM were also evaluated as more natural and of higher quality than other methods.

Based on these results, TriMM provides high-quality 3D generation even with small amounts of data and presents an effective solution to the fundamental problem of the shortage of 3D data.

Categories related to this article

nakata

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Overview

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

ROSE: A New Method And Benchmark For Video Object Removal With Side Effects

ROSE: A New Method And Benchmark For Video Object Removal With Side Effects