Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

Kai Liu, Tianyi Wu, Cong Liu, Guodong Guo

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 1187-1193. https://doi.org/10.24963/ijcai.2022/166

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attend to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.
Keywords:
Computer Vision: Recognition (object detection, categorization)
Computer Vision: Segmentation