Dec 8, 2023

Mamba: Selective State Space Model That Outperformed Transformers

An introduction to Mamba, selective state space models, and why linear-time sequence modeling is exciting for language models.

For the past few years, transformers have been the leading architecture in deep learning. Foundation models are almost universally based on the Transformer architecture and its core attention module. When scaled up, we observed their huge capabilities - ChatGPT and Claude are both based on transformer-like architecture. But is this "the final" or the best architecture? Transformers have limitations, such as the inability to model beyond a finite window and quadratic scaling with window length. What if we could create more efficient attention variants to improve effectiveness? A few days ago, researchers from Princeton University published a new architecture called Mamba: Mamba: Linear-Time Sequence Modeling with Selective State Spaces.

Motivation

The authors argue that the biggest problem of sequence modeling is compressing context into a smaller state. The problem is the context-aware ability to focus on or filter out inputs into a sequential state (selectivity). One method of incorporating a selection mechanism into models is by letting their parameters affect interactions along the sequence and be input-dependent. An existing example would be the convolution kernel of a CNN, which is also known as a hardware-friendly architecture.

Selective State Space Models

Selective state space models can be interpreted as a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), with inspiration from classical state space models (Kalman 1960). This class of models can be computed very efficiently as either a recurrence or convolution, with linear or near-linear scaling in sequence length.

Selection Mechanism. The authors proposed a simple selection mechanism by parameterizing the SSM parameters based on the input. This allowed the model to filter out irrelevant information and remember relevant information indefinitely.

Hardware-aware. The authors proposed a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution, resulting in up to 3x faster calculations on A100 GPUs.

Architecture. They simplified prior deep sequence model architectures by combining the design of prior SSM architectures with the MLP block of Transformers into a single block, leading to a simple and homogenous architecture design (Mamba) incorporating selective state spaces.

Results

By making SSM parameters input-dependent, Mamba efficiently manages sequence data, selectively focusing on relevant information. Mamba significantly outperforms Transformers in processing speed and scales linearly with sequence length, showing reasonably good results. It was tested on language, audio, and genomics tasks. Interestingly, the Mamba-3B model matches or exceeds the capabilities of Transformers e.g. Mamba-2.8B tested on multiple datasets achieves 63.3% accuracy on average, whereas GPT-Neo 2.7B achieves 56.5%, Pythia-6.9B 61.7% and OPT-6.7B 62.9%.

Authors released public code and you can run a demo on Google Colab, and the model on HuggingFace (mamba-chat). More results and answers to reviewers are available on OpenReview platform.

It is exciting to see that different architectures not only achieve similar performance as transformers but also beat them. Mamba, even at a smaller size of 3B, outperforms some 7B open-source models. If this architecture will scale as well as transformers, maybe we will see more effective 70b models with similar or even better performance. This gives hope for future development.

← AI explained