Empowering Multimodal Road Traffic Profiling with Vision Language Models and Frequency Spectrum Fusion

Empowering Multimodal Road Traffic Profiling with Vision Language Models and Frequency Spectrum Fusion

Haolong Xiang, Xiaolong Xu, Guangdong Wang, Xuyun Zhang, Xiaoyong Li, Qi Zhang, Amin Beheshti, Wei Fan

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 2694-2702. https://doi.org/10.24963/ijcai.2025/300

With the rapid urbanization in the modern era, smart traffic profiling based on multimodal sources of data has been playing a significant role in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic profiling on the road level usually utilize single-modality data, i.e., they mainly focus on image processing with deep vision models or auxiliary analysis on the textual data. However, the joint modeling and multimodal fusion of the textual and visual modalities have been rarely studied in road traffic profiling, which largely hinders the accurate prediction or classification of traffic conditions. To address this issue, we propose a novel multimodal learning and fusion framework for road traffic profiling, named TraffiCFUS. Specifically, given the traffic images, our TraffiCFUS framework first introduces Vision Language Models (VLMs) to generate text and then creates tailored prompt instructions for refining this text according to the specific scene requirements of road traffic profiling. Next, we apply the discrete Fourier transform to convert multimodal data from the spatial domain to the frequency domain and perform a cross-modal spectrum transform to filter out irrelevant information for traffic profiling. Furthermore, the processed spatial multimodal data is combined to generate fusion loss and interaction loss with contrastive learning. Finally, extensive experiments on four real-world datasets illustrate superior performance compared with the state-of-the-art approaches.
Keywords:
Data Mining: DM: Mining spatial and/or temporal data
Data Mining: DM: Applications