The ACL 2021 Paper Awards were announced this week, with the best paper honours going to a team from ByteDance AI Lab, University of Wisconsin–Madison and Nanjing University. Their paper treats vocabulary construction for machine translation, aka vocabularization, as an optimal transport (OT) problem, and proposes VOLT (Vocabulary Learning via Optimal Transport), a simple and efficient approach that works without trial training.
The performance of neural machine translation (NMT) systems is highly dependent on the choice of token vocabularies, and so it is crucial to identify a good vocabulary and find the optimal tokens — a process that typically involves intensive and laborious trial training.
In this paper, the researchers leverage optimal transport and propose VOLT as a novel way to automatically find the optimal vocabulary without trial training. The method achieves improved performance on widely-used vocabularies in diverse scenarios, including WMT-14 English-German and TED multilingual translation.
Most traditional NMT methods are built on word-level vocabularies, and although these models have achieved promising results, they fail when handling rare words under limited vocabulary sizes. Other advanced vocabularization approaches such as byte-level and character-level approaches can solve the rare words problem, but they also decrease token sparsity and increase the shared features between similar words. Even popular sub-word approaches, which achieve good results, may also result in high computation costs, as they only consider the frequency of a token while neglecting the size of the vocabulary.
To address these issues and take both entropy and vocabulary size into consideration, the team borrowed the economics concept of marginal utility, proposing the marginal utility of vocabularization (MUV) as the optimization objective. MUV evaluates the benefits (entropy) a corpus can get from an increase of cost (size), with the goal of maximizing MUV in tractable time complexity.
The team formulates vocabulary construction as a discrete optimization problem that aims to find the vocabulary with the highest MUV. Intuitively, vocabulary construction can be regarded as a process that transports chars (characters) into token candidates. Each transport matrix represents a vocabulary, and the transport matrix decides how many chars are transported to token candidates. Different transport methods bring different costs, and so the goal is to find a transport matrix that minimizes the transfer cost.
The team conducted experiments on three datasets — WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation — and identify the main results as:
- Vocabularies searched by VOLT are better than widely-used vocabularies on bilingual MT Settings.
- Vocabularies searched by VOLT are on par with heuristically-searched vocabularies on low-resource datasets.
- VOLT works well on multilingual MT settings.
- VOLT is a green vocabularization solution.
- A simple baseline with a VOLT-generated vocabulary achieves SOTA results.
- VOLT beats SentencePiece and WordPiece.
- VOLT works on various architectures.
Overall, the experiments validate VOLT’s ability to effectively find well-performing vocabularies across diverse settings.
The associated codes are available on the project GitHub. The paper Vocabulary Learning via Optimal Transport for Neural Machine Translation is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.