The field of large language models (LLMs) is rapidly evolving towards multi-modal systems, capable of processing and generating text, images, and speech. This expansion, however, introduces significant computational challenges. Training these multi-modal models requires larger datasets and more resources than text-only LLMs. To address these scaling issues, the paper introduces Mixture-of-Transformers (MoT), a novel sparse architecture designed to reduce pretraining costs while maintaining performance. MoT achieves this by decoupling non-embedding parameters of the model by modality.
Foundation models capable of handling multiple modalities are gaining prominence. Early efforts focused on understanding rather than generating multi-modal content, using late fusion techniques to merge independently encoded representations. However, the need for models that can generate content across modalities is clear. This has led to the development of unified models where data from different modalities are tokenized and processed in a similar manner. For example, images can be tokenized into discrete sequences and treated like text, and then processed by an autoregressive sequence model. The paper also touches upon sparsity approaches, which include Mixture-of-Experts (MoE) that activates a subset of experts based on a learned router. In multi-modal contexts, MoE has been applied to specific layers or modules. The paper argues that a simple rule-based approach by modality outperforms learned routing.
Modern multi-modal models often tokenize different data types into discrete sequences, allowing for unified processing with autoregressive sequence models. This approach extends the text-based paradigm to modalities like images and speech. Tokenization enables the application of similar architectures for multiple modalities. However, feature space analysis reveals that modalities cluster separately, even though their inputs are processed as uniform tokens. This clustering highlights the need for specialized processing.
MoT introduces modality-specific weights for non-embedding transformer parameters including feed-forward networks (FFNs), attention matrices, and layer normalization. This allows the model to dynamically process each modality with tailored parameters. A key aspect of MoT is its global self-attention mechanism, which captures cross-modal relationships despite the modality-specific parameter decoupling.
The paper evaluates MoT in three multi-modal experiment settings:
MoT's performance is compared to dense transformer models and MoE-4x models. Crucially, all models are designed to maintain equivalent FLOPs for both training and testing.
The paper uses the same mixed-modal training data and tokenizers as the Chameleon model, consisting of text and image tokens. Evaluation is conducted using validation losses on the Obelisc, MS-COCO, Flickr30k, and Shutterstock datasets. Model architectures range from 37M to 7B parameters. The number of layers and hidden dimensions are scaled up. Training configurations use a controlled number of tokens, batch sizes, and GPU count for a fair comparison. Mixture-of-Experts uses a state-of-the-art routing method called Expert Choice (EC). However, EC routing has some limitations, such as information leakage during validation, and uneven token distribution among experts.
The MoT model demonstrated significant pre-training acceleration, reaching the dense model's final loss in half the time. Analysis showed that MoT required only 45.5% of the training steps to achieve a similar pre-training loss compared to the dense transformer model. In particular, MoT required only 34.8% of training steps to achieve the same loss in image modality.
MoT consistently delivers significant speedups in the image modality across multiple scales from 37M to 7B, outperforming both dense and MoE-4x baselines. While MoE-4x shows diminishing returns as the model size increases, MoT maintains a consistent performance advantage. For the text modality, both MoT and MoE-4x outperform the dense model, with MoT showing slightly better gains.
This setting includes speech, using the Spirit-LM dataset. Speech input is converted to discrete tokens using a variant of DinoSR tokenizer. The models combine the speech training data with the text-and-image training dataset at a 1:6 ratio. The evaluation includes losses on the held-out sets of LibriLight (LL60K) and People's Speech Dataset (PPL30K).
MoT substantially accelerates pre-training when compared to dense and MoE-4x in the speech modality. MoT required only 22.9% of the training steps to achieve the same pretraining loss as a dense model. MoT maintains efficiency across all three modalities, achieving comparable validation losses to the dense model's final loss, at only 55.8% of the training steps.
MoT shows consistent acceleration across all three modalities, with notable improvements in speech processing across model sizes from 443M to 1.5B. The MoT models required only 15.1% to 33.6% of the training steps to match the dense model's training loss in speech. In contrast, MoE-4x showed inferior performance, particularly in speech validation.
This setting explores multi-objective training with autoregressive objectives for text and diffusion objectives for images. The paper uses the same text setup and Llama 2 tokenizer corpus. Images are encoded into latent patches using VAEs. Data was sampled at a 1:1 text to image ratio. Models are trained from sizes 163M to 7B parameters. Text to text evaluation uses Wikipedia and C4 corpus, and Text-to-image tasks uses held out Conceptual 12M, and MS-COCO benchmarks.
MoT shows significant pre-training acceleration for image modality with only 30% training steps compared to a dense baseline model. MoT shows improved performance on CLIP, FID, and CIDEr scores. While text performance does not improve much in Transfusion setting. The paper suggest decoupling of objectives might be playing a role in these results. At smaller scales, a 760M parameter model outperforms the larger 1.4B parameter model at half the FLOPs.
Across multiple scales (163M, 760M, 1.4B), MoT consistently outperforms baselines in image generation. MoT also exhibits better generalization for captioning tasks. With an 8x model size difference (163M to 1.4B), the 163M MoT still achieves comparable training loss in image modality, showcasing its strength in image processing.
Fine-tuning of MoT and dense models, shows that MoT achieves better quality and faithfulness in generated text and images. These results show that the gains of MoT are maintained during fine-tuning. The paper also shows capabilities of image editing with the fine-tuned models.
The paper investigates the impact of modality untying within different components of the transformer. It is observed that modality untying in feed forward modules showed considerable performance gain. Further untying of Q, K, and V modules shows more gains. Untying the layer norm parameters shows negligible impact.
The paper evaluates how separating modality towers impacts performance. Specifically, leave one modality out (LOO) experiments were conducted. The best performance is seen when using separate towers, with worse performance in configurations where modalities are mixed. These results indicate the importance of separate towers.
The MoT architecture results in a lower parameter to FLOP ratio. This is due to the fact that MoT's parameter count does not increase with modality count like MoE. MoT has overhead from token-grouping, while MoE has overhead from complex routing, however these overheads can be reduced by diligent engineering.
The paper investigated the horizontal scaling properties of MoT by varying the GPU count. Results show that MoT's benefits increase with GPU count. In terms of wall-clock time, MoT demonstrates considerable reduction in the GPU training time for a given amount of model performance.
The implications of this work are significant in the field of multi-modal AI. By demonstrating a sparse architecture that maintains or even enhances performance while significantly reducing computational costs, this paper paves the way for:
Smaller computational footprints means these models can be trained by a wider range of research groups and businesses, democratizing access to advanced AI.
Speedups in training times can significantly reduce the development cycle for complex multi-modal systems, leading to faster innovation.
The scalability of MoT suggests potential for widespread use in resource-constrained environments, such as edge devices or mobile applications.
The system-profiling studies reveal that MoT leads to practical benefits, including reduction in wall-clock training time for both text and image tasks.
The paper's unique insight lies in modality-specific parameter decoupling while preserving global self-attention. This approach taps into the inherent differences in how modalities are processed by the model. Key observations that drive the performance:
Different modalities occupy distinct regions of the feature space, even though their inputs are uniform tokens.
Applying unique parameters and processing strategies for each modality allows for better utilization of model capacity.
Maintaining a global self-attention mechanism enables effective cross-modal interaction without increasing parameters significantly. This approach combines the benefits of sparsity with full cross modal interaction. By doing so, MoT is able to scale to more modalities, and large models efficiently.