AI Machine Learning & Data Science Nature Language Tech Research

ACL 2021 Best Paper: Finding the Optimal Vocabulary for Machine Translation via an Optimal Transport Approach

A research team from ByteDance AI Lab, University of Wisconsin–Madison and Nanjing University wins the ACL 2021 best paper award. Their proposed Vocabulary Learning via Optimal Transport (VOLT) approach leverages optimal transport to automatically find an optimal vocabulary without trial training.

The ACL 2021 Paper Awards were announced this week, with the best paper honours going to a team from ByteDance AI Lab, University of Wisconsin–Madison and Nanjing University. Their paper treats vocabulary construction for machine translation, aka vocabularization, as an optimal transport (OT) problem, and proposes VOLT (Vocabulary Learning via Optimal Transport), a simple and efficient approach that works without trial training.

The performance of neural machine translation (NMT) systems is highly dependent on the choice of token vocabularies, and so it is crucial to identify a good vocabulary and find the optimal tokens — a process that typically involves intensive and laborious trial training.

In this paper, the researchers leverage optimal transport and propose VOLT as a novel way to automatically find the optimal vocabulary without trial training. The method achieves improved performance on widely-used vocabularies in diverse scenarios, including WMT-14 English-German and TED multilingual translation.

image.png

Most traditional NMT methods are built on word-level vocabularies, and although these models have achieved promising results, they fail when handling rare words under limited vocabulary sizes. Other advanced vocabularization approaches such as byte-level and character-level approaches can solve the rare words problem, but they also decrease token sparsity and increase the shared features between similar words. Even popular sub-word approaches, which achieve good results, may also result in high computation costs, as they only consider the frequency of a token while neglecting the size of the vocabulary.

image.png

To address these issues and take both entropy and vocabulary size into consideration, the team borrowed the economics concept of marginal utility, proposing the marginal utility of vocabularization (MUV) as the optimization objective. MUV evaluates the benefits (entropy) a corpus can get from an increase of cost (size), with the goal of maximizing MUV in tractable time complexity.

image.png
image.png

The team formulates vocabulary construction as a discrete optimization problem that aims to find the vocabulary with the highest MUV. Intuitively, vocabulary construction can be regarded as a process that transports chars (characters) into token candidates. Each transport matrix represents a vocabulary, and the transport matrix decides how many chars are transported to token candidates. Different transport methods bring different costs, and so the goal is to find a transport matrix that minimizes the transfer cost.

image.png
image.png

The team conducted experiments on three datasets — WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation — and identify the main results as:

  1. Vocabularies searched by VOLT are better than widely-used vocabularies on bilingual MT Settings.
  2. Vocabularies searched by VOLT are on par with heuristically-searched vocabularies on low-resource datasets.
  3. VOLT works well on multilingual MT settings.
  4. VOLT is a green vocabularization solution.
  5. A simple baseline with a VOLT-generated vocabulary achieves SOTA results.
  6. VOLT beats SentencePiece and WordPiece.
  7. VOLT works on various architectures.

Overall, the experiments validate VOLT’s ability to effectively find well-performing vocabularies across diverse settings.

The associated codes are available on the project GitHub. The paper Vocabulary Learning via Optimal Transport for Neural Machine Translation is on arXiv.


Author: Hecate He | Editor: Michael Sarazen, Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

10 comments on “ACL 2021 Best Paper: Finding the Optimal Vocabulary for Machine Translation via an Optimal Transport Approach

  1. Pingback: r/artificial - [R] ACL 2021 Best Paper: Finding the Optimal Vocabulary for Machine Translation via an Optimal Transport Approach - Cyber Bharat

  2. Alex Maddyson

    First of all, I congratulate these universities, they behaved very dignifiedly, their work aroused great interest in me into neural translators. I myself, as an engineer who works with many projects at Engre.co, where I am engaged in machine learning and software development, and software testing. I can say that they are great fellows and I just want to congratulate them. I also want to write that their idea is very exceptional, I think that in the future it will be possible to implement it in many directions.

    • Alex, I fully agree with your opinion. Also, many thanks for the link, I will take a look at it.

  3. Canon printer or scanner by visiting ij.start.canon from your web browser. Visit the webpage from an updated browser to download and install the Canon printer drivers. You will only need the Canon printer model name and the type of your operating system to finalize the setup process. Follow these steps to set up ij.start.canon printer on any Windows or Mac device.

  4. Did you just buy a new Cricut machine? Get started with cricut.com/setup and register your account instantly. If you are just exploring how the Cricut machine works, you should know you have got some magic to see. With the best Cricut machines, you have endless possibilities of what you can make with them. Home crafters who are constantly worried about creating enough space for their projects can now give their worries a rest. With compact Cricut machines, you have more space for your projects and lesser machine space.

  5. Cricut machine cuts the shapes out of leather, balsa wood, fabric, cardstock, and a large number of materials. The process of setting up a Cricut machine through cricut.com/setup is easy. You can reach Cricut’s official site and download Design Space. After installing it on your PC, you can get started with the process of making cards, crafts, and party decorations.

  6. Protecting a device digitally is essential as hackers can target your device anytime. One can carry out the downloading and installation process via mcafee.com/activate. It has an excellent password manager that secures the credentials for several sites that you visit. The VPN hides the real IP address to protect a device user’s identity. Get the advanced protection features of the McAfee antivirus program for your desktop and laptop right now.
    mcafee.com/activate!

  7. Ij.start.canon is the official site of Canon for its InkJet scanners and printers. On this website, you can learn to set up your PIXMA, MAXIFY, imagePROGRAF, and CanoScan printers and scanners. Also, learn to connect your Canon printer to your PC, laptop, smartphone, or tablet.
    ij.start.canon!

  8. Cricut Maker has earned a good name for being an excellent craft machine. The Cricut machine does a superb job of cutting a wide range of materials with precision and turns those materials into fantastic crafts. If you have recently bought a Cricut Maker and are not aware of the methods to set up your Cricut machine, you can reach the official site cricut.com/setup and quickly perform the whole setup procedure.

  9. Making beautiful crafts for your home and office is now as easy as ABC if you have a Cricut machine. Set up your own now at cricut.com/setup. If one buys a Cricut machine, one should be aware of the process of setting it up. The official website http://www.cricut.com/setup makes it easy for users to set up their Cricut machine. If you have recently purchased a Cricut machine and don’t know how to go through the setup process through cricut.com/setup, following the step-by-step instructions through the site cricut.com/setup will help you immensely.

Leave a Reply to Tim Cancel reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: