AI Atlas: Diverting Our Attention Once Again: A Look at Mamba
AI breakthroughs, concepts, and techniques that are tangibly valuable, specific, and actionable. Written by Glasswing Founder and Managing Partner, Rudina Seseri
In my last post I wrote about Hyena, a model developed at Stanford that challenges the existing assumption in AI that “attention is all you need.” In the field of AI, “attention” refers to the ability of language models, such as those utilizing transformers, to compare each word of a sentence to every other word, allowing the model to develop a stronger grasp of the context behind an entire passage. However, while transformer and attention models have undeniably driven a new wave in GenAI technologies, the computational cost required to scale them remains a significant drawback.
Thus, many researchers have begun to explore whether attention is the only way to unlock the incredible capabilities of large language models (LLMs). This week, I dive into another proposed alternative: Mamba.
🗺️ What is Mamba?
Mamba is an AI architecture announced at the end of last year by researchers between Carnegie Mellon University and Princeton University (go, East Coast!), designed as an alternative to transformers. Rather than utilizing attention to handle long data sequences, Mamba builds off a mechanism known as a State Space Model (SSM). In simple terms, an SSM is a “box” that holds onto key information over time. This box can be looked at through different views, depending on the data being processed and the desired outcome. For example, one perspective is based on Convolutional Neural Networks (CNNs), which are highly effective at filtering inputs and take little time to train, while another is based on Recurrent Neural Networks (RNNs), which are slow to train but generate outputs with relative ease. This combination enables SSMs to remain computationally efficient both during training and testing, with the ability to work on substantially longer data contexts.
In addition to the advancements of SSMs, Mamba introduces two key innovations:
- Selection mechanism: Central to Mamba’s design is a unique selection mechanism that adapts SSM parameters based on the input. In other words, Mamba is capable of filtering out less relevant data to focus on key information, similar to a student approaching the SAT with a question-answering strategy. First, the student reads the questions to gain insight, then reads the text, then answers each question while checking with the original source.
- Hardware-aware algorithm: Mamba’s algorithm scans the hardware of the computer on which it is being run, and directly adjusts its structure to optimize performance and memory usage. This is similar to Liquid Neural Networks, which are RNNs that dynamically change their size to avoid unnecessary processing. The result is an architecture that is significantly more efficient in processing long sequences compared to previous methods such as transformers.
With the incorporation of these characteristics, Mamba is classified as a selective structured state space sequence model, an alliteration (SSSSS) mimicking the sound of a snake, hence the name for the architecture.
🤔 What is the significance of MAMBA and what are its limitations?
Mamba has demonstrated exceptional performance across a wide range of domains, matching and even surpassing state-of-the-art transformer models. Instead of inputs being compressed into embeddings, which risk losing important context over time, Mamba has complete control over whether and how an input is remembered. This means it can theoretically retain important information across millions of datapoints, while keeping short-term details for only as long as they are needed.
- Efficiency in handling long sequences: Mamba is particularly good at handling very long sequences of data, and its performance even shows promise on sequences up to a million datapoints long. In other words, Mamba can read to the end of an entire textbook and still be able to answer questions about the table of contents from the first page.
- Faster processing: Mamba processes information up to 5x faster than transformers, which is an extremely valuable feature in real-time applications such as customer interactions.
- Versatility: Mamba maintains its quality over a variety of applications and modalities, including language, audio, and biometric data.
Mamba is still in the testing stages, and there are plenty of questions it will need to address on how it compares to the massive closed-source models developed at OpenAI and Anthropic.
- Still a proof of concept: As researchers and practitioners delve deeper into the capabilities of Mamba, we can anticipate further breakthroughs and/or stumbling blocks, making it an exciting prospect to track.
- Limited memory: While transformers are capable of looking at every input they receive, albeit with hefty scaling problems, Mamba and SSMs could theoretically be constrained by memory limitations, which act as a “maximum size” for the box in which they store information.
- Viability for non-sequential data: Mamba has shown substantial promise for data that follows a sequence, such as natural language and audio, but it remains to be seen if its performance delivers any value for applications such as image recognition, which do not rely on time series.
🛠️ Applications of MAMBA
Mamba is exciting for many generative AI use cases, given the modeling requirement of extremely long sequences.
- Natural language interaction: Faster processing and increased accuracy would make live conversations with chatbots or AI interfaces more responsive and accurate.
- Summarization and information retrieval: The longer context length of Mamba and SSMs could enable AI systems to train on much larger datasets, such as entire sales database, resulting in more accurate information retrieval and summarization functionalities.
- Audio generation: Applying Mamba as a background for speech generation systems could improve response times in voice systems, ultimately contributing to a more human-like interaction.
Stay up-to-date on the latest AI news by subscribing to Rudina’s AI Atlas.