Researchers from ByteDance AI Lab have proposed a novel pretrained language model, AMBERT (A Multigrained BERT), which leverages both fine-grained and coarse-grained tokenizations to achieve SOTA performance on English and Chinese language tasks.
Since its 2018 release, Google’s epoch-making language model BERT (Bidirectional Encoder Representations from Transformers) has exhibited powerful performance across a variety of natural language understanding (NLU) tasks and spawned other, similarly powerful pretraining models. BERT’s pretraining is based on mask language modelling, wherein some tokens in the input text are masked and the model is trained to reconstruct the original sentences. In most cases, the tokens are fine-grained, but they also can be coarse-grained. Research has shown that the fine-grained and coarse-grained approaches both have pros and cons, and the new AMBERT model is designed to take advantage of both.
“Fine-grained” applies to words or sub-words in English and individual hanzi characters in Chinese. “Coarse-grained” meanwhile refers to phrases in English and compound words in Chinese. As might be expected, fine-grained, basic lexical units are less complete but easier to learn, while coarse-grained tokens are more lexically complete but harder to learn.
Although most pretrained language models use fined-grained tokenizations, coarse-grained tokenizations are also beneficial for the following reasons:
- Useful for building a Chinese BERT when training is sufficient.
- Proven effective for language understanding tasks.
- Can substantially enhance accuracy of span selection tasks.
AMBERT has two encoders, one for processing fine-grained token sequences and another for processing coarse-grained token sequences. Also, because universal transformers with shared parameters across layers have proven powerful in the BERT architecture, AMBERT’s two encoders are designed to share the same parameters at each layer. AMBERT is thus expressive in contextualized representations, learning and utilizing both fine-grained and coarse-grained levels; and more effective, as the two encoders share parameters to reduce model size.
The researchers conducted experimental comparisons between AMBERT and baselines including fine-grained BERT and coarse-grained BERT. For Chinese, they used a corpus with 25 million documents comprising 57G in uncompressed text from Jinri Toutiao. The benchmark was the Chinese Language Understanding Evaluation dataset (CLUE).
For English, the researchers used a corpus of 13.9 million documents comprising 47G uncompressed text from Wikipedia and OpenWebText, using The General Language Understanding Evaluation (GLUE) and SQuAD tasks as benchmarks.
AMBERT achieved SOTA performance in experiments on NLU tasks, bettering BERT’s average score by about 2.7 percent on the Chinese benchmark CLUE and by over 3.0 percent on a variety of tasks in the English benchmarks GLUE and SQuAD.
The paper AMBERT: A Pre-Training Language Model with Multi-grained Tokenization is on arXiv.
Analyst: Hecate He | Editor: Michael Sarazen; Yuan Yuan
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.