Product

Thursday, 19 March 2026

Mixture of Experts Architecture in Transformer Models

 

Transformer models have proven highly effective for many NLP tasks. While scaling up with larger dimensions and more layers can increase their power, this also significantly increases computational complexity. Mixture of Experts (MoE) architecture offers an elegant solution by introducing sparsity, allowing models to scale efficiently without proportional computational cost increases.

In this post, you will learn about Mixture of Experts architecture in transformer models. In particular, you will learn about:

  • Why MoE architecture is needed for efficient transformer scaling
  • How MoE works and its key components
  • How to implement MoE in transformer models

Kick-start your project with my book Building Transformer Models From Scratch with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

Mixture of Experts Architecture in Transformer Models
Photo by realfish. Some rights reserved.

Overview

This post covers three main areas:

  • Why Mixture of Experts is Needed in Transformers
  • How Mixture of Experts Works
  • Implementation of MoE in Transformer Models

Why Mixture of Experts is Needed in Transformers

The Mixture of Experts (MoE) concept was first introduced in 1991 by Jacobs et al.. It uses multiple “expert” models to process input, with a “gate” mechanism selecting which expert to use. MoE experienced a revival with the Switch Transformer and Mixtral models in 2021 and 2024 respectively. In transformer models, MoE activates only a subset of parameters for each input, allowing large models to be defined while using only a portion for each computation.

Consider the Mixtral model architecture:

Mixtral Model Architecture

As covered in the previous post, the MLP block introduces non-linearity to transformer layers. The attention block only shuffles information from the input sequence using linear combinations. The “intelligence” of transformer models primarily resides in the MLP block.

This explains why MLP blocks typically contain the most parameters and computational load in transformer models. Training MLP blocks to perform well across diverse tasks is challenging because different tasks may require contradictory behaviors.

One solution is creating specialized models for each task with a router to select the appropriate model. Alternatively, you can combine multiple models and the router into a single model and train everything together. This is the essence of MoE.

MoE introduces sparsity by having multiple experts with only a sparse subset activated each time. The MoE architecture modifies only the MLP block while all experts share the same attention block. Each transformer layer has an independent set of experts, enabling mix-and-match combinations across layers. This allows many experts to be created without drastically expanding parameter count, scaling the model while keeping computational costs low.

The key insight is that different inputs benefit from different specialized computations. By having multiple expert networks with a routing mechanism to select which experts to use, the model achieves better performance with fewer computational resources.

How Mixture of Experts Works

MoE architecture consists of three key components:

  1. Expert Networks: Multiple independent neural networks (experts) that process input, similar to MLP blocks in other transformer models.
  2. Router: A mechanism that decides which experts should process each input. Typically a linear layer followed by softmax, producing a probability distribution over 𝑁 experts. The router output selects the top-π‘˜ experts through a “gating mechanism.”
  3. Output combination: The top-π‘˜ experts process the input, and their outputs are combined as a weighted sum using normalized probabilities from the router.

The basic MoE operation works as follows. For each vector π‘₯ from the attention block’s output sequence, the router multiplies it with a matrix to produce logits (the gate layer in the figure above). After softmax transformation, these logits are filtered by a top-π‘˜ operation, producing π‘˜ indices and π‘˜ probabilities. The indices activate the experts (MLP blocks in the figure), which process the original attention block output. Expert outputs are combined as a weighted sum using the normalized router probabilities.

Conceptually, the MoE block computes:

MoE⁡(π‘₯)=∑𝑖∈TopK⁡(𝑝)𝑝𝑖⋅Expert𝑖⁢(π‘₯)

The value of π‘˜ is a model hyperparameter. Even π‘˜ =2 has been found sufficient for good performance.

Implementation of MoE in Transformer Models

Below is a PyTorch implementation of a transformer layer with MoE replacing the traditional MLP block:

A complete MoE transformer model consists of a sequence of transformer layers. Each layer contains an attention sublayer and an MoE sublayer, with the MoE sublayer operating like the MLP sublayer in other transformer models.

In the MoELayer class, the forward() method expects an input tensor of shape (batch_size, seq_len, hidden_dim). Since each sequence vector is processed independently, the input is first reshaped to (batch_size * seq_len, hidden_dim). The router produces routing_logits of shape (batch_size * seq_len, num_experts), indicating each expert’s potential contribution to the output.

The top-π‘˜ operation selects experts and their corresponding probabilities, stored in top_k_probs. In the for-loop, each expert involved will process the vectors that it is involved, based on the token_mask. Then the output from the expert will be scaled by the corresponding export’s weight, and added to the output tensor. After the for-loop, the output is reshaped back to the original (batch_size, seq_len, hidden_dim) shape.

The Expert class is identical to the MLP block from the previous post, but multiple instances are used by the MoE sublayer instead of the transformer layer.

You can test the transformer layer with this code:

Shared Experts

The above implementation is the simplest MoE. Recently, a new idea is proposed and popularized by the DeepSeek model to include a few “shared expert” in the MoE architecture such that those shared expert will always used for any input. Mathemtically, this makes the MoE to compute:

MoE⁡(π‘₯)=Expert∗⁢(π‘₯)+∑𝑖∈TopK⁡(𝑝)𝑝𝑖⋅Expert𝑖⁢(π‘₯)

The extra expert added is the shared expert. Trivially, you can use multiple shared experts. In all cases, the shared expert does not require the router but to take the input unconditionally.

To implement the shared experts, you can reuse the above code and add extra experts in the MoeTransformerLayer class:

 

Further Readings

Below are some resources that you may find useful:

Summary

In this post, you learned about Mixture of Experts architecture in transformer models. Specifically, you learned about:

  • Why MoE is needed for efficient scaling of transformer models
  • How MoE works with expert models, routers, and gating mechanisms
  • How to implement MoE layers that can replace traditional MLP layers in transformer models

No comments:

Post a Comment

Connect broadband

Making Sense of Text with Decision Trees

  Making Sense of Text with Decision Trees Image by Editor | ChatGPT In this article, you will learn...