Weakly-Supervised Movie Trailer Generation Driven by Multi-Modal Semantic Consistency

Sidan Zhu; Yutong Wang; Hongteng Xu; Dixin Luo

doi:10.24963/ijcai.2025/1137

Weakly-Supervised Movie Trailer Generation Driven by Multi-Modal Semantic Consistency

Sidan Zhu, Yutong Wang, Hongteng Xu, Dixin Luo

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

AI, Arts & Creativity. Pages 10234-10242. https://doi.org/10.24963/ijcai.2025/1137

PDF BibTeX

As an essential movie promotional tool, trailers are designed to capture the audience's interest through the skillful editing of key movie shots. Although some attempts have been made for automatic trailer generation, existing methods often rely on predefined rules or manual fine-grained annotations and fail to fully leverage the multi-modal information of movies, resulting in unsatisfactory trailer generation results. In this study, we introduce a weakly-supervised trailer generation method driven by multi-modal semantic consistency. Specifically, we design a multi-modal trailer generation framework that selects and sorts key movie shots based on input music and movie metadata (e.g., category tags and plot keywords) and adds narration to the generated trailer based on movie subtitles. We utilize two pseudo-scores derived from the proposed framework as labels and thus train the model under a weakly-supervised learning paradigm, ensuring trailerness consistency for key shot selection and emotion consistency for key shot sorting, respectively. As a result, we can learn the proposed model solely based on movie-trailer pairs without any fine-grained annotations. Both objective experimental results and subjective user studies demonstrate the superior performance of our method over previous works. The code is available at https://github.com/Dixin-Lab/MMSC.

Keywords:

Theory and philosophy of arts and creativity in AI systems: Autonomous creative or artistic AI

Methods and resources: AI methods for better understanding human creative processes

Application domains: Images, movies and visual arts

Methods and resources: Machine learning, deep learning, neural models, reinforcement learning