Huawei & Peking U’s DiJiang: A Transformer Achieving LLaMA2-7B Performance at 1/50th the Training Cost

A research team from Huawei and Peking University introduces DiJiang, a groundbreaking Frequency Domain Kernelization approach, which facilitates the transition to a linear complexity model with minimal training overhead, achieving performance akin to LLaMA2-7B across various benchmarks, but at just 1/50th of the training cost.

The Transformer architecture has emerged as a pivotal tool in numerous domains, excelling particularly in tasks like speech recognition, machine translation, and document summarization. Yet, its efficacy often hinges on expanding the model’s size to tackle increasingly intricate challenges, thereby imposing substantial computational burdens.

In the pursuit of alleviating the computational strain associated with Transformers, the exploration of linear attention mechanisms has garnered notable traction. Nonetheless, enhancing these mechanisms typically entails extensive retraining, a prohibitive endeavor for large language models brimming with parameters.

In a new paper DiJiang: Efficient Large Language Models through Compact Kernelization, a research team from Huawei Noah’s Ark Lab and Peking University introduces DiJiang, a groundbreaking Frequency Domain Kernelization approach. This innovation facilitates the transition to a linear complexity model with minimal training overhead, achieving performance akin to LLaMA2-7B across various benchmarks, but at just 1/50th of the training cost.

The researchers initially recognized the potential of fast attention approximation techniques in mitigating computational overhead for large-scale models. However, such methods lacked thorough validation in the context of expansive language models. Through a comprehensive examination of existing linear attention schemes, the team pinpointed sampling based on the Monte Carlo method as a primary source of approximation error.

To address this, they advocate for weighted Quasi-Monte Carlo sampling, specifically introducing Frequency Domain Kernelization. This innovative approach efficiently maps the queries and keys of a Transformer to the frequency domain using Discrete Cosine Transform (DCT). Consequently, it enables the elimination of the softmax operation in the attention mechanism, resulting in linear complexity computation.

The team substantiates their proposal both theoretically and empirically. Theoretically, they demonstrate that the frequency domain mapping serves as an approximate equivalent to the original attention mechanism. Empirically, DiJiang achieves performance on par with the original Transformer but at a significantly reduced training cost (less than 1/10th) and faster inference speeds (up to approximately 10x).

In summary, DiJiang heralds a notable stride forward in crafting efficient and scalable Transformer models. Its potential for wider application holds promise for driving advancements across various natural language processing tasks and beyond.

Code is available on project’s GitHub. The paper DiJiang: Efficient Large Language Models through Compact Kernelization is on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “Huawei & Peking U’s DiJiang: A Transformer Achieving LLaMA2-7B Performance at 1/50th the Training Cost”

Huawei & Peking U’s DiJiang: A Transformer Achieving LLaMA2-7B Performance at 1/50th the Training Cost

Like this:

0 comments on “Huawei & Peking U’s DiJiang: A Transformer Achieving LLaMA2-7B Performance at 1/50th the Training Cost”

Leave a Reply Cancel reply

Related

Share this:

Like this:

0 comments on “Huawei & Peking U’s DiJiang: A Transformer Achieving LLaMA2-7B Performance at 1/50th the Training Cost”

Leave a Reply Cancel reply

Related