PatternCIR Benchmark and TisCIR: Advancing Zero-Shot Composed Image Retrieval in Remote Sensing

PatternCIR Benchmark and TisCIR: Advancing Zero-Shot Composed Image Retrieval in Remote Sensing

Zhechun Liang, Tao Huang, Fangfang Wu, Shiwen Xue, Zhenyu Wang, Weisheng Dong, Xin Li, Guangming Shi

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 1530-1538. https://doi.org/10.24963/ijcai.2025/171

Remote sensing composed image retrieval (RSCIR) is a new vision-language task that takes a composed query of an image and text, aiming to search for a target remote sensing image satisfying two conditions from intricate remote sensing imagery. However, the existing attribute-based benchmark Patterncom in RSCIR has significant flaws, including the lack of query text sentences and paired triplets, thus making it unable to evaluate the latest methods. To address this, we propose the Zero-Shot Query Text Generator (ZS-QTG) that can generate full query text sentences based on attributes, and then, by capitalizing on ZS-QTG, we develop the PatternCIR benchmark. PatternCIR rectifies Patterncom’s deficiencies and enables the evaluation of existing methods. Additionally, we explore zero-shot composed image retrieval methods that do not rely on massive pre-collected triplets for training. Existing methods use only the text during retrieval, performing poorly in RSCIR. To improve this, we propose Text-image Sequential Training of Composed Image Retrieval (TisCIR). TisCIR undergoes sequential training of multiple self-masking projection and fine-grained image attention modules, which endows it with the capacity to filter out conflicting information between the image and text, enhancing the retrieval by utilizing both modalities in harmony. TisCIR outperforms existing methods by 12.40% to 62.03% on PatternCIR, achieving state-of-the-art performance in RSCIR. The data and code are available here.
Keywords:
Computer Vision: CV: Image and video retrieval 
Computer Vision: CV: Multimodal learning
Computer Vision: CV: Scene analysis and understanding   
Computer Vision: CV: Vision, language and reasoning