July 12, 2024
4 mins

Mixture of A Million Experts

This paper from Google Deepmind scales MoE to a million experts, each with a single layer, and then uses vector based retrieval to pick the top k experts for any given query at runtime.
Paper Link
Header image

Key takeaways

  • The paper proposes a new architecture called PEER (Parameter Efficient Expert Retrieval) for efficiently utilizing a massive number of tiny experts in a mixture-of-experts (MoE) model.
  • PEER uses product key retrieval, a technique from retrieval augmented models, to efficiently route input to the appropriate experts, allowing for scalability to over a million experts.
  • Experiments on language modeling tasks demonstrate that PEER outperforms dense feedforward networks (FFW) and conventional coarse-grained MoEs in terms of performance-compute trade-off.
  • PEER is particularly advantageous in lifelong learning scenarios, where it can easily adapt to new data streams by adding new experts while retaining knowledge from previous ones.
  • By enabling the use of a large number of small experts, PEER opens up new possibilities for scaling up transformer models without incurring significant computational costs.

Introduction

This paper talks about Mixtures of Experts (MoE), a powerful concept for building large language models. Sparse MoE distributing generation task such that you can choose not to invoke all the parameters at runtime and work with a fraction of those without sacrificing quality. MoE works by having many different, smaller neural networks, called "experts," each focused on a particular task or domain. When you feed the model some text, it decides which expert is best suited to handle that specific part of the input. GPT-4, Mistral 8x7B, and many others are built with MoE as the underlying architecture. Most of them have 8 experts and 2 are activated at any given task.

Traditional transformer models, with their dense feedforward networks (FFWs), are like having one super-specialist who tries to handle everything. This can become very computationally expensive, especially as the model grows larger. MoE, on the other hand, is like having a team of smaller, focused experts, which makes the process more efficient.

Here's how MoE works:

  • Sparse Activation: MoE models activate only a small subset of experts for each input, leading to significant computational savings compared to dense FFWs.
  • Decoupling: They decouple model size from computational cost, allowing for larger models without sacrificing efficiency.

This paper highlights two crucial areas where MoE models become incredibly beneficial:

Scaling

As models grow larger, they typically become more computationally demanding. MoE allows us to build larger models without exponentially increasing computational costs, opening up new possibilities for model size and performance.

Lifelong Learning

This is the ability of a model to learn continuously from new data without forgetting what it has already learned. MoE is well-suited for lifelong learning, because you can simply add new experts to the pool as new data arrives, allowing the model to adapt and expand its knowledge.

The paper explains how MoE has emerged as a promising approach for lifelong learning, with the ability to adapt to continuous data streams by simply adding new experts and regularizing them properly.

However, there's a catch: existing MoE models are limited in how many experts they can handle. This is where the paper introduces the PEER architecture.

PEER (Parameter Efficient Expert Retrieval) stands apart from previous MoE models. It is designed to handle a vast number of tiny experts (over a million). This is a significant leap forward because it unlocks the potential to scale up MoE models even further while maintaining computational efficiency.

Background

This paper builds on several previous research efforts, drawing inspiration from both traditional MoE architectures and retrieval-augmented models. Here's a quick look at the key concepts and papers that set the stage for PEER:

MoE (Sparse Mixture of Experts)

The fundamental concept of MoE was first introduced by Shazeer et al. (2017) and has been extensively studied since then.

Scaling Laws

Recent research, like the work by Kaplan et al. (2020), has established scaling laws for language models, showing that increasing model size and training data leads to performance improvements.

Fine-grained MoE

Krajewski et al. (2024) discovered that the performance of MoE models can be further enhanced by using higher granularity, that is, using more, smaller experts. This finding is particularly relevant to the PEER architecture.

Product Key Memory (PKM)

Lample et al. (2019) proposed PKM, which uses a learned index structure to efficiently retrieve relevant information from a large memory. This technique is adapted in PEER to route inputs to the appropriate experts.

Method

The paper introduces the PEER layer, a novel architecture that allows for a much larger number of experts than previous approaches. This section dives into the technical details of how PEER works:

PEER Overview

Experts

PEER has a large pool of N experts, each represented as a simple neural network.

Product Keys

Each expert is associated with a product key, which is a vector representing its expertise.

Query Network

The PEER layer also has a query network that transforms the input into a query vector.

Retrieval

When a new input arrives, the query network generates a query vector, which is compared to the product keys of all experts. The top k experts with the most similar keys are selected.

Router Scores

The similarity between the query and each expert's key is used to calculate a router score for each expert.

Output

The outputs of the selected experts are then weighted by their router scores and combined to produce the final output.

Product Key Retrieval

The key innovation in PEER is its use of product keys for retrieval. Instead of comparing the query to each expert's key individually, the query and the product keys are split into sub-parts. This makes the retrieval process much more efficient, allowing PEER to scale to a massive number of experts.

Parameter Efficient Experts and Multi-Head Retrieval

PEER uses single-neuron MLPs as its experts, which means each expert has only one hidden layer with a single neuron. This is what makes the experts “parameter-efficient.”

The paper explains how PEER achieves a larger capacity without significantly increasing computation by using a technique called multi-head retrieval. This is similar to the multi-head attention mechanism in transformers:

Multiple Query Networks

PEER has h independent query networks, each producing a query vector and retrieving a set of k experts.

Shared Experts

All heads share the same pool of experts and product keys.

Output

The outputs from all h heads are summed up to create the final output.

The key point is that this approach allows for sharing hidden neurons across different experts, leading to more efficient knowledge transfer and parameter efficiency.

Why a Large Number of Small Experts?

The paper provides a compelling argument for using a large number of small experts instead of a small number of large experts. The authors explain that:

Active Parameters

The computational cost is primarily determined by the number of active parameters, which are the parameters that are used for each input token.

Granularity

The granularity of an MoE model is the number of active experts. Higher granularity generally leads to better performance.

Parameter Efficiency

PEER uses the smallest possible expert size by setting the number of neurons in each expert to one, and the number of activated neurons is the number of retrieval heads multiplied by the number of experts retrieved per head. This approach maximizes granularity and parameter efficiency, leading to better performance with lower computational costs. From the paper:

Experiments

This section showcases the results of various experiments designed to evaluate the performance of PEER:

Pretraining IsoFLOP Analysis

IsoFLOP Curves

The paper uses isoFLOP analysis to compare the performance of PEER with other methods. This involves fixing a computational budget (measured in FLOPs) and varying the model size and number of training tokens. The goal is to determine which model achieves the best performance (lowest perplexity) for a given compute budget.

Baselines

The paper compares PEER with dense FFWs, coarse-grained MoEs, and Product Key Memory (PKM) layers.

Results

The results show that PEER consistently outperforms the other methods, demonstrating its superior compute-performance trade-off.

Evaluation on Language Modeling Datasets

Datasets

The paper evaluates the performance of pretrained PEER models on several popular language modeling datasets, including the Curation Corpus, Lambada, the Pile, Wikitext, and the C4 dataset.

Results

PEER consistently achieves lower perplexities than the other methods, indicating its superior performance.

Ablations

Varying the Number of Experts

The paper investigates the impact of changing the number of total experts. The results show that increasing the number of experts generally leads to better performance.

Varying the Number of Active Experts

The paper studies the effect of changing the number of active experts (granularity). The results show that higher granularity leads to better performance but also increases memory consumption.

Expert Usage and Query Batch Normalization

The paper examines the distribution of expert usage and finds that PEER effectively utilizes a large number of experts. The paper also shows that adding a batch normalization layer to the query network can further improve expert usage and performance.

Business Implications

The paper's findings have significant implications for the business world. Here are some potential areas where PEER could be impactful:

Cost-Effective Model Scaling

By enabling the efficient use of massive numbers of experts, PEER allows companies to develop larger and more powerful language models without incurring significant computational costs. This opens up possibilities for creating models with better performance and more sophisticated capabilities.

Faster Training and Inference

PEER's efficiency in both training and inference can accelerate model development cycles, leading to faster time-to-market for new products and services.

Lifelong Learning Applications

The paper highlights the potential of PEER in lifelong learning applications. This could be particularly valuable in industries where models need to adapt to changing data streams, such as in personalized recommendation systems, fraud detection, and medical diagnostics.

Reduced Hardware Costs

PEER's efficient use of resources can significantly reduce the computational requirements for training and deploying large language models. This could potentially lead to lower hardware costs and make it more feasible to run large models on devices with limited resources.

Conclusion: A Leap Forward in Efficiency

This paper introduces a significant advance in MoE architecture design with the introduction of PEER, a groundbreaking approach that paves the way for efficiently scaling up language models without compromising performance. The paper highlights the critical importance of both scaling and lifelong learning in the field of artificial intelligence, and it demonstrates that PEER is a powerful tool for achieving both goals.

The combination of product key retrieval and parameter-efficient experts offers a compelling solution to the computational challenges associated with training and deploying extremely large language models. PEER opens the door to new possibilities for model development, allowing researchers and developers to explore the frontiers of language modeling with unprecedented efficiency and scale. The business implications of these advances are far-reaching, with the potential to transform various industries and drive significant innovation.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.