MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance

Video Teaser

Abstract

Promptable segmentation has emerged as a powerful paradigm in computer vision, enabling users to guide models in parsing complex scenes with prompts such as clicks, boxes, or textual cues. Recent advances, exemplified by the Segment Anything Model (SAM), have extended this paradigm to videos and multi-view images. However, the lack of 3D awareness often leads to inconsistent results, necessitating costly per-scene optimization to enforce 3D consistency. In this work, we introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps--3D points reconstructed from unposed images by recent visual geometry models. Leveraging the pixel–point one-to-one correspondence of pointmaps, MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data. Specifically, MV-SAM extends SAM by lifting image embeddings from its pretrained encoder into 3D point embeddings, which are decoded by a transformer using cross-attention with 3D prompt embeddings. This design aligns 2D interactions with 3D geometry, enabling the model to implicitly learn consistent masks across views through 3D positional embeddings. Trained on the SA-1B dataset, our method generalizes well across domains, outperforming SAM2-Video and achieving comparable performance with per-scene optimization baselines on NVOS, SPIn-NeRF, ScanNet++, uCo3D, and DL3DV benchmarks. Code will be released.

Video summary

BibTeX

@article{jeong2025mv,
  title={MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance},
  author={Jeong, Yoonwoo and Sun, Cheng and Wang, Yu-Chiang Frank and Cho, Minsu and Choe, Jaesung},
  journal={arXiv},
  year={2026},
}

MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance

MV-SAM extends Segment Anything Model (SAM) into multi-view images without using annotated 3D or video datasets.
🔥 Our MV-SAM outperforms SAMv3 as well.

Video Teaser

Abstract

Method & Results

We introduce MV-SAM where users can provide prompts in multi-view images for user-interactive segmentation.

Unlike mask propagation of SAM2-Video, which lacks 3D understanding, our MV-SAM propagates all prompts in a consistent manner using pointmap.

By removing the reliance on 3D explicit networks or 3D inductive bias, we first train a 3D foundation model on a 2D large-scale dataset, SA-1B.

Thereby, our MV-SAM enables 3D promptable segmentation in various domains, demonstrating strong generalizability.

Video results

Video summary

BibTeX

MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance

MV-SAM extends Segment Anything Model (SAM) into multi-view images without using annotated 3D or video datasets. 🔥 Our MV-SAM outperforms SAMv3 as well.

Video Teaser

Abstract

Method & Results

We introduce MV-SAM where users can provide prompts in multi-view images for user-interactive segmentation.

Unlike mask propagation of SAM2-Video, which lacks 3D understanding, our MV-SAM propagates all prompts in a consistent manner using pointmap.

By removing the reliance on 3D explicit networks or 3D inductive bias, we first train a 3D foundation model on a 2D large-scale dataset, SA-1B.

Thereby, our MV-SAM enables 3D promptable segmentation in various domains, demonstrating strong generalizability.

Video results

Video summary

BibTeX

MV-SAM extends Segment Anything Model (SAM) into multi-view images without using annotated 3D or video datasets.
🔥 Our MV-SAM outperforms SAMv3 as well.