The feedforward (FFW) layers in standard transformer architectures experience a linear increase in computational costs and activation memory as the hidden layer width expands. To address this issue, sparse mixture-of-experts (MoE) architectures have emerged, effectively decoupling model size from computational cost. A recent discovery, the fine-grained MoE scaling law, shows that higher granularity leads to better performance. However, existing MoE models are limited by computational and optimization challenges, restricting the number of experts they can employ.
In a new paper Mixture of A Million Experts, a Google DeepMind research team introduces parameter efficient expert retrieval (PEER), a innovative layer design leverages the product key technique for sparse retrieval from an extensive pool of tiny experts (over a million). Its impressive performance-compute trade-off unlocks the potential for further scaling transformer models while maintaining computational efficiency.

The team highlights their main contributions as follows:
- Exploration of Extreme MoE Setting: Departing from the conventional focus on a small number of large experts, this work investigates the under-explored scenario of numerous tiny experts.
- Learned Index Structure for Routing: For the first time, the study demonstrates that a learned index structure (Kraska et al., 2018) can efficiently route to over a million experts.
- New Layer Design: By combining product key routing with single-neuron experts, the PEER layer expands layer capacity without significant computational overheads. Empirical results show its superior efficiency compared to dense FFW, coarse-grained MoEs, and Product Key Memory (PKM) layers.
- Comprehensive Ablation Studies: The researchers explore various design choices of PEER, such as the number of experts, active parameters, number of heads, and query batch normalization, focusing on their impact on language modeling tasks.

A PEER layer is formally defined as a function consisting of three components: a pool of experts, each sharing the same signature; a corresponding set of product keys; and a query network that maps the input vector to a query vector.
A PEER layer can be inserted into the middle of a transformer backbone or used to replace FFW layers. Given the state vector from the previous layer, a query network maps it to a query vector. This vector is then compared with the product keys to compute the router scores and retrieve the top experts. After the retrieved experts make their predictions, their outputs are linearly combined using softmax-normalized router scores as weights.

In their empirical study, the researchers conducted isoFLOP analysis on language modeling tasks, comparing PEER with various baselines. The results demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off.
The paper Mixture of A Million Experts is on arXiv.
Author: Hecate He | Editor: Chain Zhang

I had a great experience with nursing essay writer service https://domyessay.com/nursing-essay-writing-service. The writer provided a comprehensive and well-structured essay that perfectly addressed my topic. Their expertise in nursing was evident through the detailed research and clear presentation. The timely delivery and high-quality work made a significant difference in managing my workload.
DeepMind’s introduction of the PEER (Partitioned Expert Encoder with Randomization) layer revolutionizes Transformer models by leveraging a vast ensemble of specialized experts. This approach enhances model efficiency and scalability, enabling Transformers to harness the collective power of a million experts, leading to significant improvements in performance and adaptability across diverse tasks. This innovation marks a pivotal advancement in the field of AI, pushing the boundaries of what Transformers can achieve.
https://digitalfloats.com/
DeepMind’s introduction of the PEER (Partitioned Expert Encoder with Randomization) layer revolutionizes Transformer models by leveraging a vast ensemble of specialized experts. This approach enhances model efficiency and scalability, enabling Transformers to harness the collective power of a million experts, leading to significant improvements in performance and adaptability across diverse tasks. This innovation marks a pivotal advancement in the field of AI, pushing the boundaries of what Transformers can achieve.
https://dolphindentalclinics.com/
We often talk about revolutionizing tech, but when it comes to nursing assignments, even that could use a bit of innovation! I was swamped with clinical and had no time to work on my papers. That’s when I found a resource that focuses on nursing paper writing https://www.nursingpaper.com/ and it was a huge relief. It made a big difference in how I managed my workload.
thank you for the last information
Exciting times ahead for the field of artificial intelligence!
Lanamade is a distinguished provider of premium Korean aesthetic and skincare products, dedicated to enhancing natural beauty through scientifically-backed solutions. Their extensive product range includes top-quality fillers, botulinum toxins, dermal fillers, lipolytics, skin boosters, and more, catering to diverse beauty and skincare needs.
https://lanamade.net/