Code available here (no implementation at the moment of writing this review)
An amazing paper from Microsoft Research Asia presents a brand new vision Transformer called Swin Transformer that
can serve as a backbone just like usual CNNs in computer vision and Transformers in natural language processing (NLP).
There are two main problems with the usage of Transformers for computer vision. Firstly, existing Transformer-based models have tokens of a fixed scale. However in contrast to the word tokens, visual elements can be different in scale. Secondly, computational complexity of self-attention is quadratic to image size, causing problems in vision tasks with dense predictions at the pixel level.
Authors offer strategies to solve these challenges:
- Hierarchical feature maps for convenient utilization of techniques like feature pyramid networks (FPN) or U-Net for dense predictions.
- Computing self-attention locally within non-overlapping windows with equal number of patches to achieve linear complexity.
Swin Transformer outperformes current state-of-the-art approaches on both COCO object detection and ADE20K semantic segmentation while achieving the best speed-accuracy trade-off on image classification.
- Splitting RGB image into non-overlapping patches (tokens).
- Applying linear embedding layer to translate raw feature into an arbitrary dimension.
- Applying several Swin Transformer blocks with modified self-attention computation and maintaining the number of tokens.
- Reducing number of tokens by patch merging layers creating the same feature map resolutions like those in common CNNs.
Standard global self-attention is not quite suitable for representations of high-resolution images because of quadratic complexity.
Authors propose to compute self-attention within local windows.
Here is the comparison of computational complexities of a global MSE module and a new window-based one
- M is the size of M x M patch
- Images consists of h x w patches
Moreover, there is an idea about transferring shifted window partitioning strategy to the next MSA block to create
additional connections across windows.
Computation of Swin Transformer blocks:
Shifted window partitioning will result in more windows and some of these windows will be smaller than M x M. So, authors propose efficient batch computation approach with cyclic shifting toward the top-left direction. Since batched window might be created of sub-windows which are not adjacent in the feature map, mask is applied.
Also, relative position bias is used
Experiments were made on ImageNet-1K image classification, COCO object detection and ADE20K semantic segmentation.
ImageNet-22K was used for pre-training and ImageNet-1K for fine-tuning.
Here we can see that Swin Transformers achieve great speed-accuracy trade-off compared with the state-of-the-art CNNs.
Swin Transformers scores better compared with ResNet-50, ResNeXt and DeiT as a backbone for Cascade Mask R-CNN and other detection models. Also the inference speed is much higher than DeiT’s because of linear complexity to input image size.
Proposed model surpasses other backbones in ADE20K too.
Finally, there is an interesting ablation study that shows us that shifted window approach outperforms single window partitioning with relatively small latency overhead.
I find this paper captivating and useful since it opens new possibilities of developing a unified architecture for computer vision and natural language processing tasks, which can benefit both fields and accelerate shared research.