A bold new research initiative is leveraging large language models and high-performance computing (HPC) in a bid to extract features from single proteins and upscale protein language models to larger databases and models.
Natural Language Processing (NLP) has dramatically improved in recent years thanks to HPC advancements and automatic regression (AR) and automatic coding (AE) techniques. A trained language model can extract features to use as input for a subsequently trained supervised model through transfer-learning — and protein research is an excellent use case for transfer-learning since the sequence-annotation gap expands quickly. Using AI prediction approaches to address the gap is a key challenge in computational biology and bioinformatics.
A team of researchers from Technical University of Munich (TUM), Med AI Technology (Wu Xi) Ltd, Google AI, NVIDIA and Oak Ridge National Laboratory (ORNL) recently launched the ProtTrans Project, which provides an outstanding model for protein pretraining. ProtTrans was trained on thousands of GPUs and hundreds of Google TPUs using various transformers models. The project has been open-sourced on GitHub and is backed by many partner companies and research institutions.
The researchers trained two AR language models (Transformer-XL and XLNet) and two AE models (BERT and Albert) on 80 billion amino acids from 200 million protein sequences. Also, a Transformer-XL model was trained on 393 billion amino acids from 2.1 billion protein sequences from the most abundant protein sequences set Big Fat Database. The large-scale language model training was done on the IBM-designed Summit supercomputer, whose 266.7 PFLOPS make it the second-fastest on the planet.
Through multiple experiments, the researchers demonstrated the feasibility of training large language models on proteins and scaling up the language models to larger models with massive data. It was shown that HPC can also successfully expand the scale of protein language models and reduce the gap between large language models and traditional models trained on evolutionary information — for example regarding sequence-structure and sequence-function gaps, or more generally in sequence-annotation gaps.
The researchers have pledged to share all the pretrained models and associated knowledge, and have invited other interested parties to participate in the project by supporting error-fixing, proposing new features, contributing to documentation improvement, etc. The ProtTrans team plans to regularly update the project with new protein pretraining models to support the bioinformatics community and Covid-19 research.
The paper ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing is on arXiv.
Author: Herin Zhao | Editor: Michael Sarazen & Fangyu Cai
Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors
This report offers a look at how the Chinese government and business owners have leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.
Pingback: [R] ProtTrans Delivers SOTA Pretrained Models for Proteins – tensor.io