ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise  Perspective for Medical Image Segmentation

Huimin Huang; Shiao Xie; Lanfen Lin; Yutaro Iwamoto; Xian-Hua Han; Yen-Wei Chen; Ruofeng Tong

doi:10.24963/ijcai.2022/135

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise  Perspective for Medical Image Segmentation

Huimin Huang, Shiao Xie, Lanfen Lin, Yutaro Iwamoto, Xian-Hua Han, Yen-Wei Chen, Ruofeng Tong

Watch video

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Main Track. Pages 964-971. https://doi.org/10.24963/ijcai.2022/135

PDF BibTeX

Recently, a variety of vision transformers have been developed as their capability of modeling long-range dependency. In current transformer-based backbones for medical image segmentation, convolutional layers were replaced with pure transformers, or transformers were added to the deepest encoder to learn global context. However, there are mainly two challenges in a scale-wise perspective: (1) intra-scale problem: the existing methods lacked in extracting local-global cues in each scale, which may impact the signal propagation of small objects; (2) inter-scale problem: the existing methods failed to explore distinctive information from multiple scales, which may hinder the representation learning from objects with widely variable size, shape and location. To address these limitations, we propose a novel backbone, namely ScaleFormer, with two appealing designs: (1) A scale-wise intra-scale transformer is designed to couple the CNN-based local features with the transformer-based global cues in each scale, where the row-wise and column-wise global dependencies can be extracted by a lightweight Dual-Axis MSA. (2) A simple and effective spatial-aware inter-scale transformer is designed to interact among consensual regions in multiple scales, which can highlight the cross-scale dependency and resolve the complex scale variations. Experimental results on different benchmarks demonstrate that our Scale-Former outperforms the current state-of-the-art methods. The code is publicly available at: https://github.com/ZJUGiveLab/ScaleFormer.

Keywords:

Computer Vision: Segmentation

Computer Vision: Biomedical Image Analysis

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation

Huimin Huang, Shiao Xie, Lanfen Lin, Yutaro Iwamoto, Xian-Hua Han, Yen-Wei Chen, Ruofeng Tong

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise  Perspective for Medical Image Segmentation

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise  Perspective for Medical Image Segmentation