Q-MiniSAM2: A Quantization-based Benchmark for Resource-Efficient Video Segmentation

Xuanxuan Ren; Xiangyu Li; Kun Wei; Xu Yang; Yanhua Yang

doi:10.24963/ijcai.2025/204

Q-MiniSAM2: A Quantization-based Benchmark for Resource-Efficient Video Segmentation

Xuanxuan Ren, Xiangyu Li, Kun Wei, Xu Yang, Yanhua Yang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 1829-1837. https://doi.org/10.24963/ijcai.2025/204

PDF BibTeX

Segment Anything Model 2 (SAM2) is a new-generation, high-precision model for image and video segmentation, offering extensive application prospects across numerous computer vision fields. However, as a large-scale model, its huge memory demands and expansive computing costs pose challenges for practical deployment. This paper presents Q-MiniSAM2, an efficient Quantization-based segmentation benchmark tailored to optimize SAM2 by Minimizing memory consumption and accelerating computations. We begin with applying Post-Training Quantization (PTQ) to SAM2, requiring only a relatively small dataset for network calibration, thereby eliminating the need for retraining. Building upon PTQ, we further introduce a Hierarchy-based Video Quantization method to enhance the model’s capacity to capture video semantics and temporal correlations across different time scales. Furthermore, we observe that SAM2’s memory overhead is predominantly concentrated on processing historical frames, and the redundant cross-attention computations significantly increase memory and computational costs due to the imperceptible change of the short time intervals between these frames. To tackle this issue, an Adaptive Mutual-KV mechanism is proposed to mitigate excessive cross-attention by leveraging inter-frame similarities. Comprehensive experiments demonstrate that the proposed approach achieves superior performance compared to state-of-the-art methods, underscoring its potential for efficient and scalable video segmentation.

Keywords:

Computer Vision: CV: Recognition (object detection, categorization)

Computer Vision: CV: Multimodal learning

Computer Vision: CV: Segmentation, grouping and shape analysis