A new Google Research study proposes modifying the standard transformer architecture to process byte sequences in natural language processing (NLP). The researchers show that in terms of parameter count, training FLOPs and inference speed, their proposed byte-level models can be competitive with the token-level approach typically employed by contemporary large language models.
Tokenization is the process of splitting sentences or texts into a sequence of tokens. While it is a common data preprocessing procedure for NLP tasks, tokenization can struggle with typos, variants in spelling and capitalization, morphological changes and the out-of-vocabulary tokenization problem.
One way to address these issues is to create token-free models that operate directly on raw text, storing text data as a sequence of bytes which the model uses to process arbitrary text sequences. This approach however introduces a significant computation burden, as byte sequences are much longer than their corresponding word-level token sequences.
In the paper ByT5: Towards a Token-Free Future With Pre-Trained Byte-to-Byte Models, the Google team proposes ByT5. Rather than using a subword vocabulary like other pretrained large language models, ByT5 operates directly on UTF-8 bytes. The novel architecture eliminates the need for any text preprocessing and can be easily adapted to process byte sequences without adding excessive computational cost.
The proposed ByT5 is based on Google’s recent token-based mT5 (Massively Multilingual Text-to-Text Transfer Transformer), which was trained on a large corpus of unlabelled text data and has achieved state-of-the-art performance across various multilingual NLP tasks. The researchers make mT5 token-free by performing a minimal set of modifications that do not dramatically increase computational cost.
Key changes to the m15 design include feeding UTF-8 bytes of the SentencePiece (a re-implementation of sub-word units) vocabulary directly into the model without any text preprocessing and embedding these bytes to the model’s hidden size. An additional 3 IDs are also reserved for special tokens: padding, end-of-sentence, and an unused token.
The team then modifies the pretrained task such that instead of adding 100 new tokens for the sentinels, they reuse the final 100 byte IDs. Also, rather than using an average span length of 3 subword tokens, they mask longer byte-spans with a mean mask span length set at 20 bytes.
The team discovered that byte-level models with a “heavier” encoder perform better on both classification and generation tasks, and so set their encoder depth to three times that of the decoder.
Finally, they drop any illegal bytes in the model’s output to keep all byte sequences legal under the UTF-8 standard.
To evaluate their transformer architecture modifications’ performance on byte-level processing with regard to compute cost trade-offs, the team compare ByT5 against mT5 on a wide range of tasks on standard English and multilingual NLP benchmarks.
In cross-lingual understanding on the XTREME benchmark, the researchers compared F1/EM scores on the Question Answering task. On the most realistic in-language setting, ByT5 beat mT5 across all tasks and model sizes. ByT5 also achieved impressive performance on tasks such as English classification and generation.
The evaluations show that the proposed ByT5 outperforms mT5 at model sizes under one billion parameters, on generative tasks, on multilingual tasks with in-language labels, and in the presence of various types of noise. Overall, the results demonstrate that ByT5 is a competitive byte-level model that can effectively balance computational cost trade-offs.
Google Research has released a set of pretrained byte-level transformer models and all ByT5 code on the project GitHub. The paper ByT5: Towards a Token-Free Future With Pre-Trained Byte-to-Byte Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.