Revolutionizing Transformers: DeepMind’s PEER Layer and the Power of a Million Experts

A DeepMind research team introduces PEER, a innovative layer design leverages the product key technique for sparse retrieval from an extensive pool of tiny experts (over a million), which unlocks the potential for further scaling transformer models while maintaining computational efficiency.

by Synced

2024-07-18

Comments 8

The feedforward (FFW) layers in standard transformer architectures experience a linear increase in computational costs and activation memory as the hidden layer width expands. To address this issue, sparse mixture-of-experts (MoE) architectures have emerged, effectively decoupling model size from computational cost. A recent discovery, the fine-grained MoE scaling law, shows that higher granularity leads to better performance. However, existing MoE models are limited by computational and optimization challenges, restricting the number of experts they can employ.

In a new paper Mixture of A Million Experts, a Google DeepMind research team introduces parameter efficient expert retrieval (PEER), a innovative layer design leverages the product key technique for sparse retrieval from an extensive pool of tiny experts (over a million). Its impressive performance-compute trade-off unlocks the potential for further scaling transformer models while maintaining computational efficiency.

The team highlights their main contributions as follows:

Exploration of Extreme MoE Setting: Departing from the conventional focus on a small number of large experts, this work investigates the under-explored scenario of numerous tiny experts.
Learned Index Structure for Routing: For the first time, the study demonstrates that a learned index structure (Kraska et al., 2018) can efficiently route to over a million experts.
New Layer Design: By combining product key routing with single-neuron experts, the PEER layer expands layer capacity without significant computational overheads. Empirical results show its superior efficiency compared to dense FFW, coarse-grained MoEs, and Product Key Memory (PKM) layers.
Comprehensive Ablation Studies: The researchers explore various design choices of PEER, such as the number of experts, active parameters, number of heads, and query batch normalization, focusing on their impact on language modeling tasks.

A PEER layer is formally defined as a function consisting of three components: a pool of experts, each sharing the same signature; a corresponding set of product keys; and a query network that maps the input vector to a query vector.

A PEER layer can be inserted into the middle of a transformer backbone or used to replace FFW layers. Given the state vector from the previous layer, a query network maps it to a query vector. This vector is then compared with the product keys to compute the router scores and retrieve the top experts. After the retrieved experts make their predictions, their outputs are linearly combined using softmax-normalized router scores as weights.

In their empirical study, the researchers conducted isoFLOP analysis on language modeling tasks, comparing PEER with various baselines. The results demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off.

The paper Mixture of A Million Experts is on arXiv.

Author: Hecate He | Editor: Chain Zhang

8 comments on “Revolutionizing Transformers: DeepMind’s PEER Layer and the Power of a Million Experts”

yohan

2024-07-22

I had a great experience with nursing essay writer service https://domyessay.com/nursing-essay-writing-service. The writer provided a comprehensive and well-structured essay that perfectly addressed my topic. Their expertise in nursing was evident through the detailed research and clear presentation. The timely delivery and high-quality work made a significant difference in managing my workload.

Loading...

Reply
sanjay

2024-07-23

DeepMind’s introduction of the PEER (Partitioned Expert Encoder with Randomization) layer revolutionizes Transformer models by leveraging a vast ensemble of specialized experts. This approach enhances model efficiency and scalability, enabling Transformers to harness the collective power of a million experts, leading to significant improvements in performance and adaptability across diverse tasks. This innovation marks a pivotal advancement in the field of AI, pushing the boundaries of what Transformers can achieve.

https://digitalfloats.com/

Loading...

Reply
sanjay

2024-07-23

DeepMind’s introduction of the PEER (Partitioned Expert Encoder with Randomization) layer revolutionizes Transformer models by leveraging a vast ensemble of specialized experts. This approach enhances model efficiency and scalability, enabling Transformers to harness the collective power of a million experts, leading to significant improvements in performance and adaptability across diverse tasks. This innovation marks a pivotal advancement in the field of AI, pushing the boundaries of what Transformers can achieve.
https://dolphindentalclinics.com/

Loading...

Reply
BillyJ

2024-09-27

We often talk about revolutionizing tech, but when it comes to nursing assignments, even that could use a bit of innovation! I was swamped with clinical and had no time to work on my papers. That’s when I found a resource that focuses on nursing paper writing https://www.nursingpaper.com/ and it was a huge relief. It made a big difference in how I managed my workload.

Loading...

Reply
kamir bouchareb st

2025-01-29

thank you for the last information

Loading...

Reply
Lanamade Net

2025-02-26

Exciting times ahead for the field of artificial intelligence!

Lanamade is a distinguished provider of premium Korean aesthetic and skincare products, dedicated to enhancing natural beauty through scientifically-backed solutions. Their extensive product range includes top-quality fillers, botulinum toxins, dermal fillers, lipolytics, skin boosters, and more, catering to diverse beauty and skincare needs.

https://lanamade.net/

Loading...

Reply
anna

2026-03-12

The idea behind the PEER layer and using a huge number of experts is really fascinating. It shows how transformer models can scale while still improving efficiency and specialization. Innovations like this could change how AI systems learn and process complex tasks. On a creative side, AI tools are also making design easier. I recently tried the Police Logo Design Maker from Namecheap https://www.namecheap.com/logo-maker/ideas/police-logos/ and it quickly generated clean, professional-style logo concepts.

Loading...

Reply
Harris

2026-04-15

Thank you for sharing this insightful update on Mixture-of-Experts architectures! The concept of scaling to over a million experts using PEER is fascinating, and it’s impressive how Google DeepMind continues to push the boundaries of efficiency and performance in transformer models. The idea of improving compute-to-performance trade-offs while maintaining scalability is definitely a big step forward in AI research.

For those interested in understanding and promoting cutting-edge technologies like AI, machine learning, and digital innovations, I’d also recommend checking out PDM Training Institute (https://www.pdmti.in/
). They provide practical training in digital marketing and online branding, helping individuals effectively communicate and market advanced tech solutions in today’s competitive landscape.

Really appreciate you sharing such valuable and forward-thinking content

Loading...

Reply

Revolutionizing Transformers: DeepMind’s PEER Layer and the Power of a Million Experts

Like this:

8 comments on “Revolutionizing Transformers: DeepMind’s PEER Layer and the Power of a Million Experts”

Leave a Reply Cancel reply

Related

Share this:

Like this:

8 comments on “Revolutionizing Transformers: DeepMind’s PEER Layer and the Power of a Million Experts”

Leave a Reply Cancel reply

Related