April 3, 2024
5 mins

Jamba - A hybrid Transformer-Mamba Language Model

Jamba by AI21 Labs combines transformer layers with Mamba (SSM) layers and implements a MoE layers in middle to get a compute efficient model with high throughput.
Paper Link
Header image
Weekly newsletter
No spam. Just the latest researches, and exclusive interviews in your inbox every week.
Read about our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Key takeaways:

  • Jamba is a new large language model (LLM) by AI21 Labs that uses a hybrid architecture combining Transformer and Mamba layers, along with a mixture-of-experts (MoE) component.
  • This hybrid design allows Jamba to achieve state-of-the-art performance on standard language model benchmarks while also supporting long context lengths (up to 256K tokens) and maintaining a manageable memory footprint.
  • Jamba is also highly efficient, with throughput up to 3x that of comparable models like Mixtral.
  • The Jamba architecture is flexible and allows for different configurations depending on hardware and performance requirements.
  • The release of Jamba under a permissive license encourages further exploration and optimization of this novel architecture by the community.

Introduction

Jamba is a new publicly available large language model based on a novel hybrid architecture. It combines Transformer layers (on which GPT4 - and earlier versions - and most of other popular LLMs like Llama2, Mistral are based on) with Mamba layers (a recent state space model) and Mixture-of-Experts components)

Jamba's Hybrid Approach

Transformers is the most popular approach to training  LLMs. Most popular LLMs like Mistral, Llama2, GPT series are based on transformers. Though using transformers has two big drawbacks:

  • High memory and compute requirements hinder the processing of long contexts as the Key-value pair cache becomes a limiting factor. Only now we see long contexts in Gemini 1.5 - which uses a sparser version of transformers (switch transformers) in a Mixture-of-Experts (MoE) implementation.
  • Lack of a single summary state entails slow inference and low throughput, since each generated token performs a computation on the entire context.

Recent space-state-models (SSMs) are more efficient to train, more capable at handling long distance relationships, takes up less compute, but come with the issues of their own. They do not offer the same performance compared to a similar sized transformer architecture.

Jamba bridges this gap by strategically combining Transformer and Mamba layers in a certain ratio. Varying the ratio of Transformer/Mamba layers allows balancing memory usage, efficient training, and long context capabilities.

Jamba also includes MoE layers which allow increasing the model capacity (total number of available parameters) without increasing compute requirements (number of active parameters). GPT4, Google Gemini 1.5, Mistral, and Claude all are based on MoE implementation. In Jamba, MoE is applied to some of the MLP layers. The more MoE layers, and the more experts in each MoE layer, the larger the total number of model parameters.

Architecture

Figure 1:(a) A single Jamba block. (b) Different types of layers.

Jamba's architecture is built upon "Jamba blocks" that interleave Transformer and Mamba layers. Each block consists of either an attention or a Mamba module, followed by a multi-layer perceptron (MLP).

The ratio of Transformer to Mamba layers within each block can be adjusted to prioritize different objectives. For instance, increasing the proportion of Mamba layers reduces memory requirements and improves throughput, especially for long sequences.

Furthermore, Jamba incorporates a mixture-of-experts (MoE) component, where some MLP layers are replaced with MoE layers. This allows for increased model capacity without significantly increasing compute requirements.

This flexible design empowers Jamba to adapt to different hardware and performance needs, making it a versatile solution for various applications.

Advantages

Jamba's hybrid architecture offers several advantages over existing LLMs:

  • High Performance: Jamba achieves performance comparable to state-of-the-art Transformer-based models of similar size on standard language benchmarks
  • Long Context Support: Jamba supports context lengths of up to 256K tokens, significantly exceeding the capabilities of other publicly available models. This allows Jamba to process and understand information within a much broader context, leading to more accurate and relevant responses.
  • High Throughput: Jamba boasts up to 3x higher throughput than comparable models, especially when dealing with long contexts. This translates to faster processing and improved efficiency for tasks requiring real-time responses.
  • Manageable Memory Footprint: Jamba's use of Mamba layers significantly reduces the memory required for the key-value cache compared to pure Transformer models. This enables Jamba to fit on a single 80GB GPU even when processing long texts, making it more accessible and cost-effective.
Fig: Jamba's performance on needle in a haystack task

Potential Business Use Cases

Jamba's unique capabilities make it well-suited for various business applications that require long context handling and high throughput:

  • Content generation and all the existing use cases you would expect from an LLM.
  • Cost savings when deployed on an enterprise cloud due to manageable memory footprint.
  • High performance in similar scenaiors compared to transformer based models.

I mean, this is another Language model, and would have similar applications. Enterprises should get excited by the memory footprint part given it is clearly going to bring down their GPU costs. Atleast for inference and hopefully for training too though the paper makes no mention of that.

In Context Learning

In LLMs, in context learning is the part where you pass a context as part of the prompt in zero shot or few shot learning and the model is able to answer based on the context provided. In Mamba, this does not work(see below). This hybrid architecture enables businesses to deploy a version of Mamba with the ability to capture and preserve long distance relationships while saving on compute and training costs.

Why does this work?

The combination of Transformer and Mamba layers in Jamba works well because it leverages the strengths of both architectures.

Transformers excel at capturing long-distance relationships within text, which is crucial for tasks like question answering and summarization. However, they can be computationally expensive and memory-intensive, especially for long contexts.

Mamba layers, on the other hand, are more efficient and handle long contexts better but may not be as good at capturing long-distance relationships and struggle with In Context Learning. That is because of lack of attention mechanism in the SSM model.

Effect of MoE

MoE improves Transformer language models while keeping compute manageable. While there is no conclusive evidence that it works well for SSMs, but the paper shows the effect of Mixture of Experts on the hybrid model. MoE improves the performance of the hybrid Attention-Mamba architecture at large scale (7B parameters trained on 50B tokens)

In summary, by combining these two types of layers, Jamba gets the best of both worlds: it can capture long-distance relationships effectively while also being efficient and able to handle long contexts. With MoE, the sparsity improves the performance of a hybrid model across all benchmarks and keeps the compute manageable.

Conclusion

Jamba represents a significant step forward in LLM development, demonstrating the potential of hybrid architectures to overcome the limitations of existing models. Its ability to handle long contexts, maintain high throughput, and deliver state-of-the-art performance makes it a powerful tool for various business applications.

PS: Jamba supports 256k in context length.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.