Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention

Xin Hu; Lingling Zhang; Jun Liu; Xinyu Zhang; Wenjun Wu; Qianying Wang

doi:10.24963/ijcai.2023/93

Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention

Xin Hu, Lingling Zhang, Jun Liu, Xinyu Zhang, Wenjun Wu, Qianying Wang

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Main Track. Pages 837-845. https://doi.org/10.24963/ijcai.2023/93

PDF BibTeX

Diagram visual grounding aims to capture the correlation between language expression and local objects in the diagram, and plays an important role in the applications like textbook question answering and cross-modal retrieval. Most diagrams consist of several colors and simple geometries. This results in sparse low-level visual features, which further aggravates the gap between low-level visual and high-level semantic features of diagrams. The phenomenon brings challenges to the diagram visual grounding. To solve the above issues, we propose a gestalt-perceptual attention model to align the diagram objects and language expressions. For low-level visual features, inspired by the gestalt that simulates human visual system, we build a gestalt-perception graph network to make up the features learned by the traditional backbone network. For high-level semantic features, we design a multi-modal context attention mechanism to facilitate the interaction between diagrams and language expressions, so as to enhance the semantics of diagrams. Finally, guided by diagram features and linguistic embedding, the target query is gradually decoded to generate the coordinates of the referred object. By conducting comprehensive experiments on diagrams and natural images, we demonstrate that the proposed model achieves superior performance over the competitors. Our code will be released at https://github.com/AIProCode/GPA.

Keywords:

Computer Vision: CV: Vision and language

Computer Vision: CV: Recognition (object detection, categorization)

Computer Vision: CV: Representation learning