DisPIM: Distilling PreTrained Image Models for Generalizable Visuo-Motor Control

Haitao Wang; Hejun Wu

doi:10.24963/ijcai.2025/978

DisPIM: Distilling PreTrained Image Models for Generalizable Visuo-Motor Control

Haitao Wang, Hejun Wu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 8796-8804. https://doi.org/10.24963/ijcai.2025/978

PDF BibTeX

We introduce DisPIM, a framework that leverages pretrained image models (PIMs) for visuo-motor control. Applying PIMs to visuo-motor control faces a big difficulty due to the distribution shift between the distribution of visual environmental states and that of the pretraining datasets. Due to such a distribution shift, fine-tuning PIMs specifically for visuo-motor control may hurt the generalizability of PIMs, while adding additional tunable parameters for specific actions apparently lead to high computational costs. DisPIM addresses these challenges using a novel feature distillation approach, which obtains a compact model that not only inherit the generalization capability of PIMs but also acquire task-specific skills for visuo-motor control. This good for both sides is mainly achieved by means of a target Q-ensemble mechanism, which is inspired by double Q-learning. This Q-ensemble mechanism can adaptively adjust the distillation rate, so as to balance the objective of generalization and task-specific ability during training. With this balancing mechanism, DisPIM achieves both task-specific and generalizable control requiring a low computation cost. Across a series of algorithms, task domains, and evaluation metrics in both simulation and real robot, our DisPIM demonstrates significant improvements in generalization and overall performance with low computational overhead.

Keywords:

Robotics: ROB: Behavior and control

Robotics: ROB: Manipulation

Robotics: ROB: Robotics and vision