Objective

Mixture of Experts (MoE) is a 📝Machine Learning (ML) architecture that processes data by routing inputs through specialized sub-networks called experts, each optimized for different patterns or tasks. A gating network (router) dynamically selects which expert or combination of experts handles each input, activating only the most relevant pathways rather than the entire model. This sparse activation enables MoE models to scale to massive parameter counts while using far less compute per inference than dense models of comparable capacity. Originally introduced in the 1990s, MoE gained prominence with modern transformers like GPT-4 and Mixtral, enabling efficient pretraining at unprecedented scale.

Contexts

🏷️#machine-learning (see: 📝Machine Learning (ML))