January 21, 2025
7 mins

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Explore the Mixture-of-Transformers (MoT) architecture, a novel approach for creating scalable multi-modal foundation models. MoT reduces pretraining costs while maintaining performance in text, image, and speech generation tasks.
Paper Link
Header image
Weekly newsletter
No spam. Just the latest researches, and exclusive interviews in your inbox every week.
Read about our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Key Takeaways

  • Mixture-of-Transformers (MoT) is a sparse multi-modal architecture that reduces pre-training computational costs.
  • MoT decouples non-embedding parameters of the model by modality.
  • MoT uses global self-attention over the full input sequence to capture cross-modal relationships.
  • In Chameleon, MoT matched the dense baseline performance using only 55.8% of the FLOPs.
  • When extended to speech, MoT achieves comparable speech performance with only 37.2% of the FLOPs.
  • In Transfusion, MoT achieves the dense baseline's performance in image modality at a third of the FLOPs.
  • System profiling demonstrated significant wall-clock training time reduction using MoT.
  • The ablation studies highlights that the feed-forward layers and attention weights, and not necessarily the layer norms, contribute most to performance benefits from MoT.
  • Combination of MoT with MoE-4x, is able to speed-up text training, while maintaining image quality.
  • The approach also increases efficiency in multi-objective settings, and offers a path to handle diverse modalities.

Introduction

The field of large language models (LLMs) is rapidly evolving towards multi-modal systems, capable of processing and generating text, images, and speech. This expansion, however, introduces significant computational challenges. Training these multi-modal models requires larger datasets and more resources than text-only LLMs. To address these scaling issues, the paper introduces Mixture-of-Transformers (MoT), a novel sparse architecture designed to reduce pretraining costs while maintaining performance. MoT achieves this by decoupling non-embedding parameters of the model by modality.

Relevant Background Work

Foundation models capable of handling multiple modalities are gaining prominence. Early efforts focused on understanding rather than generating multi-modal content, using late fusion techniques to merge independently encoded representations. However, the need for models that can generate content across modalities is clear. This has led to the development of unified models where data from different modalities are tokenized and processed in a similar manner. For example, images can be tokenized into discrete sequences and treated like text, and then processed by an autoregressive sequence model. The paper also touches upon sparsity approaches, which include Mixture-of-Experts (MoE) that activates a subset of experts based on a learned router. In multi-modal contexts, MoE has been applied to specific layers or modules. The paper argues that a simple rule-based approach by modality outperforms learned routing.

Method: Mixture-of-Transformers Architecture

Background: Foundation Models for Multi-Modal Generation

Modern multi-modal models often tokenize different data types into discrete sequences, allowing for unified processing with autoregressive sequence models. This approach extends the text-based paradigm to modalities like images and speech. Tokenization enables the application of similar architectures for multiple modalities. However, feature space analysis reveals that modalities cluster separately, even though their inputs are processed as uniform tokens. This clustering highlights the need for specialized processing.

Mixture-of-Transformers Architecture: Modality-Specific Parameter Decoupling

MoT introduces modality-specific weights for non-embedding transformer parameters including feed-forward networks (FFNs), attention matrices, and layer normalization. This allows the model to dynamically process each modality with tailored parameters. A key aspect of MoT is its global self-attention mechanism, which captures cross-modal relationships despite the modality-specific parameter decoupling.

Experiments

Results Overview

The paper evaluates MoT in three multi-modal experiment settings:

  1. Chameleon: Autoregressive text and image generation
  2. Chameleon+Speech: Adding speech as a third modality
  3. Transfusion: Autoregressive text and diffusion-based image generation

MoT's performance is compared to dense transformer models and MoE-4x models. Crucially, all models are designed to maintain equivalent FLOPs for both training and testing.

Performance in the Chameleon Setting: Autoregressive Objectives for Text and Image Generation

Experiment Setup

The paper uses the same mixed-modal training data and tokenizers as the Chameleon model, consisting of text and image tokens. Evaluation is conducted using validation losses on the Obelisc, MS-COCO, Flickr30k, and Shutterstock datasets. Model architectures range from 37M to 7B parameters. The number of layers and hidden dimensions are scaled up. Training configurations use a controlled number of tokens, batch sizes, and GPU count for a fair comparison. Mixture-of-Experts uses a state-of-the-art routing method called Expert Choice (EC). However, EC routing has some limitations, such as information leakage during validation, and uneven token distribution among experts.

Accelerated Pre-Training at 7B Scale

The MoT model demonstrated significant pre-training acceleration, reaching the dense model's final loss in half the time. Analysis showed that MoT required only 45.5% of the training steps to achieve a similar pre-training loss compared to the dense transformer model. In particular, MoT required only 34.8% of training steps to achieve the same loss in image modality.

Performance Across Multiple Model Scales

MoT consistently delivers significant speedups in the image modality across multiple scales from 37M to 7B, outperforming both dense and MoE-4x baselines. While MoE-4x shows diminishing returns as the model size increases, MoT maintains a consistent performance advantage. For the text modality, both MoT and MoE-4x outperform the dense model, with MoT showing slightly better gains.

Extending to a Third Modality: Chameleon Text+Image+Speech Results

Experiment Setup

This setting includes speech, using the Spirit-LM dataset. Speech input is converted to discrete tokens using a variant of DinoSR tokenizer. The models combine the speech training data with the text-and-image training dataset at a 1:6 ratio. The evaluation includes losses on the held-out sets of LibriLight (LL60K) and People's Speech Dataset (PPL30K).

Performance with Speech Integration at 7B Scale

MoT substantially accelerates pre-training when compared to dense and MoE-4x in the speech modality. MoT required only 22.9% of the training steps to achieve the same pretraining loss as a dense model. MoT maintains efficiency across all three modalities, achieving comparable validation losses to the dense model's final loss, at only 55.8% of the training steps.

Scalability Across Model Sizes

MoT shows consistent acceleration across all three modalities, with notable improvements in speech processing across model sizes from 443M to 1.5B. The MoT models required only 15.1% to 33.6% of the training steps to match the dense model's training loss in speech. In contrast, MoE-4x showed inferior performance, particularly in speech validation.

Multi-Objective Training in the Transfusion Setting: Autoregressive Text and Diffusion-Based Image Generation

Experiment Setup

This setting explores multi-objective training with autoregressive objectives for text and diffusion objectives for images. The paper uses the same text setup and Llama 2 tokenizer corpus. Images are encoded into latent patches using VAEs. Data was sampled at a 1:1 text to image ratio. Models are trained from sizes 163M to 7B parameters. Text to text evaluation uses Wikipedia and C4 corpus, and Text-to-image tasks uses held out Conceptual 12M, and MS-COCO benchmarks.

Mixture of Transformers Enhances Multi-Objective Training Efficiency

MoT shows significant pre-training acceleration for image modality with only 30% training steps compared to a dense baseline model. MoT shows improved performance on CLIP, FID, and CIDEr scores. While text performance does not improve much in Transfusion setting. The paper suggest decoupling of objectives might be playing a role in these results. At smaller scales, a 760M parameter model outperforms the larger 1.4B parameter model at half the FLOPs.

Scalability Across Model Sizes

Across multiple scales (163M, 760M, 1.4B), MoT consistently outperforms baselines in image generation. MoT also exhibits better generalization for captioning tasks. With an 8x model size difference (163M to 1.4B), the 163M MoT still achieves comparable training loss in image modality, showcasing its strength in image processing.

Performance with Fine-tuning

Fine-tuning of MoT and dense models, shows that MoT achieves better quality and faithfulness in generated text and images. These results show that the gains of MoT are maintained during fine-tuning. The paper also shows capabilities of image editing with the fine-tuned models.

Impact of Modality Untying in Different Transformer Components

The paper investigates the impact of modality untying within different components of the transformer. It is observed that modality untying in feed forward modules showed considerable performance gain. Further untying of Q, K, and V modules shows more gains. Untying the layer norm parameters shows negligible impact.

Modality Separation in MoT: Leave-One-Out Analysis

The paper evaluates how separating modality towers impacts performance. Specifically, leave one modality out (LOO) experiments were conducted. The best performance is seen when using separate towers, with worse performance in configurations where modalities are mixed. These results indicate the importance of separate towers.

ML Systems Aspects of Mixture-of-Transformers

Throughput Scaling Properties

The MoT architecture results in a lower parameter to FLOP ratio. This is due to the fact that MoT's parameter count does not increase with modality count like MoE. MoT has overhead from token-grouping, while MoE has overhead from complex routing, however these overheads can be reduced by diligent engineering.

Empirical Analysis

The paper investigated the horizontal scaling properties of MoT by varying the GPU count. Results show that MoT's benefits increase with GPU count. In terms of wall-clock time, MoT demonstrates considerable reduction in the GPU training time for a given amount of model performance.

Business and Research Implications

The implications of this work are significant in the field of multi-modal AI. By demonstrating a sparse architecture that maintains or even enhances performance while significantly reducing computational costs, this paper paves the way for:

More Accessible AI

Smaller computational footprints means these models can be trained by a wider range of research groups and businesses, democratizing access to advanced AI.

Faster Innovation

Speedups in training times can significantly reduce the development cycle for complex multi-modal systems, leading to faster innovation.

Practical Applications

The scalability of MoT suggests potential for widespread use in resource-constrained environments, such as edge devices or mobile applications.

Better Large Scale Training

The system-profiling studies reveal that MoT leads to practical benefits, including reduction in wall-clock training time for both text and image tasks.

What Makes This Work?

The paper's unique insight lies in modality-specific parameter decoupling while preserving global self-attention. This approach taps into the inherent differences in how modalities are processed by the model. Key observations that drive the performance:

Modality Clustering

Different modalities occupy distinct regions of the feature space, even though their inputs are uniform tokens.

Modality-Specific Processing

Applying unique parameters and processing strategies for each modality allows for better utilization of model capacity.

Global Self-Attention

Maintaining a global self-attention mechanism enables effective cross-modal interaction without increasing parameters significantly. This approach combines the benefits of sparsity with full cross modal interaction. By doing so, MoT is able to scale to more modalities, and large models efficiently.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.