Enabling Visual Foundation Models to Teach Compact Students via Mixture of Distillation

Xinye Yang; Shang Wang; Li Luking; Yipeng Chen

doi:10.24963/ijcai.2025/1281

Enabling Visual Foundation Models to Teach Compact Students via Mixture of Distillation

Xinye Yang, Shang Wang, Li Luking, Yipeng Chen

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 11145-11153. https://doi.org/10.24963/ijcai.2025/1281

PDF BibTeX

In this paper, we present a novel Mixture of Distillation (MoD) framework for distilling lightweight student models using Visual Foundation Models (VFMs) as teachers. Knowledge distillation (KD) is a crucial training strategy for improving model performance. However, conventional KD methods face two main challenges: (1) selecting \& training appropriate teacher models and (2) designing effective knowledge distillation techniques. To address the first challenge, we leverage recent VFMs like CLIP, Grounding DINO, and SAM as teachers, capitalizing on their remarkable zero-shot generalization abilities and low fine-tuning requirements for new tasks, thereby avoiding expensive retraining of teachers. For the second challenge, our MoD framework focuses on extracting and decomposing the feature and logit knowledge from VFMs into multiple knowledge experts, which capture modality-specific information across batches, channels, and instances. Each knowledge expert undergoes separate projections, reshaping, normalization, and learnable magnitude operations. Then, we employ sparse knowledge gates with a softmax function followed by a KeepTopK operation for different knowledge experts. In this way, our MoD not only bridges the distillation gap between VFMs and students but also allows the adaptive transfer of useful knowledge across different domains. Extensive experiments on various classification, detection, and medical segmentation tasks validate the effectiveness of our approach with other models. Moreover, our MoD framework demonstrates the potential for transferring zero-shot abilities from VFMs without relying on ground-truth labels. Notably, our MoD achieves impressive performance, attaining 72.48% for RepViT with 76.20% CLIP teacher on ImageNet-1K without annotations.

Keywords:

Computer Vision: CV: Efficiency and Optimization

Computer Vision: CV: Machine learning for vision

Machine Learning: ML: Deep learning architectures

Machine Learning: ML: Foundation models