MVTracker: A Multi-view 3D Point Tracking Method That Achieves High Accuracy With A Small Number Of Cameras

21/09/2025

3 main points
✔️ MVTracker is the first data-driven multi-view 3D point tracking method that works with a small number of cameras
✔️ Combines 3D feature point cloud, k-nearest neighbor correlation, and transformer for long-term tracking
✔️ Experiments show practical performance with 2cm error and 7.2FPS, significantly better than conventional methods

Multi-View 3D Point Tracking
written by Frano Rajič, Haofei Xu, Marko Mihajlovic, Siyuan Li, Irem Demir, Emircan Gündoğdu, Lei Ke, Sergey Prokudin, Marc Pollefeys, Siyu Tang
(Submitted on 28 Aug 20252)
Comments: ICCV 2025, Oral. Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper proposes a new method, MVTracker, for tracking arbitrary 3D points using multiple camera images.

Conventional monocular-based methods are vulnerable to depth ambiguity and shielding, making it difficult to track 3D points with high accuracy in the real world.
Existing multi-camera methods also require more than 20 cameras and sequential optimization, making them impractical.

MVTracker is the first data-driven multi-view 3D tracker that can operate with a realistic number of cameras (e.g., four) and supports online processing.
The method integrates features and depth information from multiple views to build a 3D feature point cloud, from which correlation calculations are performed using k-nearest neighbor search.

Furthermore, sequential trajectory updates using a spatio-temporal transformer module enable long-term 3D point tracking.
Validated on real data sets such as Panoptic Studio and DexYCB, the method achieved accuracy significantly better than conventional methods and presented a new standard as a versatile and efficient point tracking infrastructure. The results show a new standard as a versatile and efficient point tracking platform.

Proposed Method

The central idea of MVTracker is to generate a unified 3D feature point cloud from multi-view video and compute point-to-point correlations within it.

A feature map is extracted from each frame using a CNN and projected into 3D space using depth information (sensor derived or estimated) and camera parameters.
Within the feature point cloud thus constructed, local correlations are calculated using k-nearest neighbor search and combined with appearance similarity and spatial offset for tracking.

The data is then input to a transformer that processes the time series using a sliding window method and successively updates the point locations and features through a self-attention mechanism.
This mechanism allows the system to be robust to shielding and complex motion. Training is performed on simulated data (5,000 sequences by Kubric), and the loss function is defined by a combination of position error and visibility judgments.

Compared to the conventional triplane representation, the 3D point cloud representation has fewer information losses and can flexibly accommodate different numbers and arrangements of cameras, which is a key advantage.

Experiments

The authors evaluated MVTracker's performance on several datasets, including Panoptic Studio, DexYCB, and MV-Kubric.

They compared using positional accuracy (δavg), mid-trajectory error (MTE), occlusion accuracy (OA), and the overall Jaccard index (AJ) as metrics.
The results showed that MVTracker achieved an AJ of 86.0 with Panoptic Studio and an AJ of 71.6 with DexYCB, both significantly better than conventional methods.

In particular, DexyCB achieved a median error as low as 2.0 cm.
The performance tends to improve as the number of input views is increased, reaching AJ 79.2 with 8 views.

Furthermore, it is robust to differences in depth estimation sources (sensor origin, estimated value), and accuracy is further improved when sensor depth is used.
The inference speed reached 7.2 FPS, showing that MVTracker is better suited for real-time processing than conventional optimization-based methods.

From these results, we conclude that MVTracker is a promising approach that balances accuracy, efficiency, and versatility in real-world applications.

Categories related to this article

nakata

MVTracker: A Multi-view 3D Point Tracking Method That Achieves High Accuracy With A Small Number Of Cameras

Summary

Proposed Method

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation