Vision Mamba: Like a Vision Transformer but Better | by Sascha Kirch | Sep, 2024

0
8


This is part 4 of my new multi-part series 🐍 Towards Mamba State Space Models for Images, Videos and Time Series.

The field of computer vision has seen incredible advances in recent years. One of the key enablers for this development has been undoubtedly the introduction of the Transformer. While the Transformer has revolutionized natural language processing, it took us some years to transfer its capabilities to the vision domain. Probably the most prominent paper was the Vision Transformer (ViT), a model that is still used as the backbone in many of the modern architectures.

It’s again the Transformer’s O(L²) complexity that limits its application as the image’s resolution grows. Being equipped with the Mamba selective state space model, we are now able to let history repeat itself and transfer the success of SSMs from sequence data to non-sequence data: Images.

❗ Spoiler Alert: VisionMamba is 2.8x faster than DeiT and saves 86.8% GPU memory on high-resolution images (1248×1248) and in this article, you’ll see how…