Machine Learning & Data Science Nature Language Tech

AMBERT: BERT with Multi-Grained Tokenization Achieves SOTA Results on English and Chinese NLU Tasks

AMBERT (A Multigrained BERT) leverages both fine-grained and coarse-grained tokenizations to achieve SOTA performance on English and Chinese language tasks.

Researchers from ByteDance AI Lab have proposed a novel pretrained language model, AMBERT (A Multigrained BERT), which leverages both fine-grained and coarse-grained tokenizations to achieve SOTA performance on English and Chinese language tasks.

Since its 2018 release, Google’s epoch-making language model BERT (Bidirectional Encoder Representations from Transformers) has exhibited powerful performance across a variety of natural language understanding (NLU) tasks and spawned other, similarly powerful pretraining models. BERT’s pretraining is based on mask language modelling, wherein some tokens in the input text are masked and the model is trained to reconstruct the original sentences. In most cases, the tokens are fine-grained, but they also can be coarse-grained. Research has shown that the fine-grained and coarse-grained approaches both have pros and cons, and the new AMBERT model is designed to take advantage of both.

image.png

“Fine-grained” applies to words or sub-words in English and individual hanzi characters in Chinese. “Coarse-grained” meanwhile refers to phrases in English and compound words in Chinese. As might be expected, fine-grained, basic lexical units are less complete but easier to learn, while coarse-grained tokens are more lexically complete but harder to learn.

image.png
First-layer attention maps of fine-grained BERT models for English and Chinese sentences
image.png
First-layer attention maps of coarse-grained BERT models for English and Chinese sentences


Although most pretrained language models use fined-grained tokenizations, coarse-grained tokenizations are also beneficial for the following reasons:

  • Useful for building a Chinese BERT when training is sufficient.
  • Proven effective for language understanding tasks.
  • Can substantially enhance accuracy of span selection tasks.

AMBERT has two encoders, one for processing fine-grained token sequences and another for processing coarse-grained token sequences. Also, because universal transformers with shared parameters across layers have proven powerful in the BERT architecture, AMBERT’s two encoders are designed to share the same parameters at each layer. AMBERT is thus expressive in contextualized representations, learning and utilizing both fine-grained and coarse-grained levels; and more effective, as the two encoders share parameters to reduce model size.

image.png
AMBERT’s multi-grained representation creation process


The researchers conducted experimental comparisons between AMBERT and baselines including fine-grained BERT and coarse-grained BERT. For Chinese, they used a corpus with 25 million documents comprising 57G in uncompressed text from Jinri Toutiao. The benchmark was the Chinese Language Understanding Evaluation dataset (CLUE).

image.png
Performance on classification tasks in CLUE in terms of accuracy (%)
image.png
Performance on reading comprehensive tasks in CLUE in terms of F1, EM (Exact Match) and accuracy (%)
image.png
State-of-the-art results for Chinese base models in CLUE in terms of accuracy (%)

For English, the researchers used a corpus of 13.9 million documents comprising 47G uncompressed text from Wikipedia and OpenWebText, using The General Language Understanding Evaluation (GLUE) and SQuAD tasks as benchmarks.

image.png
Performance on GLUE tasks in terms of accuracy (%)
image.png
Performance on three English reading comprehensive tasks in terms of F1, EM (Exact Match) and accuracy (%)
image.png
State-of-the-art results of English base models in GLUE
image.png
Sample sentence matching tasks in both English and Chinese

AMBERT achieved SOTA performance in experiments on NLU tasks, bettering BERT’s average score by about 2.7 percent on the Chinese benchmark CLUE and by over 3.0 percent on a variety of tasks in the English benchmarks GLUE and SQuAD.

The paper AMBERT: A Pre-Training Language Model with Multi-grained Tokenization is on arXiv.


Analyst: Hecate He | Editor: Michael Sarazen; Yuan Yuan


Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon KindleAlong with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.


We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

3 comments on “AMBERT: BERT with Multi-Grained Tokenization Achieves SOTA Results on English and Chinese NLU Tasks

Leave a Reply

Your email address will not be published. Required fields are marked *

%d