Jamba is a new publicly available large language model based on a novel hybrid architecture. It combines Transformer layers (on which GPT4 - and earlier versions - and most of other popular LLMs like Llama2, Mistral are based on) with Mamba layers (a recent state space model) and Mixture-of-Experts components)
Transformers is the most popular approach to training LLMs. Most popular LLMs like Mistral, Llama2, GPT series are based on transformers. Though using transformers has two big drawbacks:
Recent space-state-models (SSMs) are more efficient to train, more capable at handling long distance relationships, takes up less compute, but come with the issues of their own. They do not offer the same performance compared to a similar sized transformer architecture.
Jamba bridges this gap by strategically combining Transformer and Mamba layers in a certain ratio. Varying the ratio of Transformer/Mamba layers allows balancing memory usage, efficient training, and long context capabilities.
Jamba also includes MoE layers which allow increasing the model capacity (total number of available parameters) without increasing compute requirements (number of active parameters). GPT4, Google Gemini 1.5, Mistral, and Claude all are based on MoE implementation. In Jamba, MoE is applied to some of the MLP layers. The more MoE layers, and the more experts in each MoE layer, the larger the total number of model parameters.
Figure 1:(a) A single Jamba block. (b) Different types of layers.
Jamba's architecture is built upon "Jamba blocks" that interleave Transformer and Mamba layers. Each block consists of either an attention or a Mamba module, followed by a multi-layer perceptron (MLP).
The ratio of Transformer to Mamba layers within each block can be adjusted to prioritize different objectives. For instance, increasing the proportion of Mamba layers reduces memory requirements and improves throughput, especially for long sequences.
Furthermore, Jamba incorporates a mixture-of-experts (MoE) component, where some MLP layers are replaced with MoE layers. This allows for increased model capacity without significantly increasing compute requirements.
This flexible design empowers Jamba to adapt to different hardware and performance needs, making it a versatile solution for various applications.
Jamba's hybrid architecture offers several advantages over existing LLMs:
Jamba's unique capabilities make it well-suited for various business applications that require long context handling and high throughput:
I mean, this is another Language model, and would have similar applications. Enterprises should get excited by the memory footprint part given it is clearly going to bring down their GPU costs. Atleast for inference and hopefully for training too though the paper makes no mention of that.
In LLMs, in context learning is the part where you pass a context as part of the prompt in zero shot or few shot learning and the model is able to answer based on the context provided. In Mamba, this does not work(see below). This hybrid architecture enables businesses to deploy a version of Mamba with the ability to capture and preserve long distance relationships while saving on compute and training costs.
The combination of Transformer and Mamba layers in Jamba works well because it leverages the strengths of both architectures.
Transformers excel at capturing long-distance relationships within text, which is crucial for tasks like question answering and summarization. However, they can be computationally expensive and memory-intensive, especially for long contexts.
Mamba layers, on the other hand, are more efficient and handle long contexts better but may not be as good at capturing long-distance relationships and struggle with In Context Learning. That is because of lack of attention mechanism in the SSM model.
MoE improves Transformer language models while keeping compute manageable. While there is no conclusive evidence that it works well for SSMs, but the paper shows the effect of Mixture of Experts on the hybrid model. MoE improves the performance of the hybrid Attention-Mamba architecture at large scale (7B parameters trained on 50B tokens)
In summary, by combining these two types of layers, Jamba gets the best of both worlds: it can capture long-distance relationships effectively while also being efficient and able to handle long contexts. With MoE, the sparsity improves the performance of a hybrid model across all benchmarks and keeps the compute manageable.
Jamba represents a significant step forward in LLM development, demonstrating the potential of hybrid architectures to overcome the limitations of existing models. Its ability to handle long contexts, maintain high throughput, and deliver state-of-the-art performance makes it a powerful tool for various business applications.
PS: Jamba supports 256k in context length.