The Transformer architecture has demonstrated remarkable scalability, leading to substantial improvements in accuracy. However, this advancement comes at the cost of exceedingly high computational requirements, which have emerged as a significant obstacle in real-world applications.
Although researchers have actively pursued solutions to reduce the dimensions of Transformer components and prune elements like attention heads, another critical component, the Feed Forward Network (FFN), has remained relatively underexplored.
In a recent paper titled “One Wide Feedforward is All You Need,” a collaborative research effort from Equall and Apple delves into the role of the FFN and uncovers a surprising revelation: despite consuming a significant portion of the model’s parameters, the FFN exhibits high redundancy. As a result, the researchers propose sharing a single FFN across both the encoder and decoder, thereby reducing the parameter count while causing only a modest drop in accuracy.
In the Transformer architecture, two main components reign supreme: attention and the FFN. Typically, FFNs occupy roughly two-thirds of the parameter budget, leaving attention with the remaining third. In their study, the researchers explore parameter sharing between the encoder and decoder FFNs, aiming to assess its impact on model accuracy.
The overarching objective is to strike a balance between model size, latency, and accuracy. The research team’s primary focus revolves around answering the following questions:
- How many parameters can be shared or pruned with minimal to no degradation in accuracy?
- Do the encoder and decoder FFNs exhibit similar effects when shared?
- Can FFN parameters be allocated more efficiently while maintaining the same model size?
To address these questions, the researchers introduce the “One Wide FFN” model, a novel architectural approach that features a single shared wide FFN in the encoder, complemented by an FFN in the decoder. They also employ Linear Centered Kernel Alignment to assess the similarity between internal representations and Local Neighborhood Similarity to gauge semantic space similarity across different models.
The results of their study demonstrate that both model accuracy and the internal representations of the Transformer remain stable when employing the One Wide FFN model architecture. Meanwhile, a significant reduction in the number of parameters has been achieved, offering promise for more efficient and practical implementation of Transformer models.
The paper One Wide Feedforward is All You Need on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.