Sequences are most important in deep learning. Whether it is in natural language processing (NLP) or for biological data (RNA sequences), neural networks try to find a representation for sequences of tokens and classify them accordingly or generate new ones following a given logic. There are generally two approaches for this task: The first one is Recurrent Neural Networks (RNN) and its variants (GRU and LSTM), the second one is Transformers. The first method looks at elements in the sequence recursively while the second one focuses on self-attention between elements of the sequence.
Each approach has had great success but neither is particularly suited for long sequences. Experiments show that LSTM have a difficult time dealing with sequences longer than 5000 steps, while Transformers are not adapted to it because of the large memory requirements. Memory requirements for Transformers are a squared function of the number of steps which makes it prohibitively expensive to use in the context of biological data.
The research team at ReDNA Labs has devised a new lightweight approach called IGLOO which allows to deal with sequences up to 25,000 steps long and this methodology has already been applied to real world problems to establish state of the art results. The idea of this new approach is to use a form of correlation between the different time steps of the sequence in order to find a global representation which can then be used for classification. The memory requirements of this method are fixed with respect to the number of time steps and it does not suffer from the vanishing gradient problem like the LSTM.
The methodology has been applied to the task of predicting whether a sequence of RNA will be coding for proteins or not. The study named “RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences” was published in NAR Genomics and Bioinformatics (March 2020).
IGLOO available at: https://arxiv.org/abs/1807.03402
RNAsamba available at: https://academic.oup.com/nargab/article/2/1/lqz024/5701461
The IGLOO paper which introduces the general methodology of the approach shows how to find a representation for a sequence. The general approach is as follows. After running the input through a one dimensional convolutional layer, the algorithm collects J groups of N patches. Those patches are concatenated, multiplied by a trainable weight and a bias is added. Intuitively, this relates to calculating the correlation between non contiguous sections of the sequence. Having a large number of J allows to cover each time step, while this is not even necessary for convergence for most cases. Ultimately, IGLOO produces a vector of size J which can then be used in downstream tasks for classification or generation.
The IGLOO paper shows how this is applied to stress testing tasks such as permuted MNIST where it achieves the state of the art performance (98.4%).
In a different area, natural language modeling has been dominated by Transformers based models in the last few years and that approach relies on self-attention. In order to find a representation for the sequence, transformers calculate how different tokens in the sequence are related to each other. The IGLOO paper shows that it is possible to replace the self-attention mechanism used in Transformers by a nIGLOO block which in turn requires less memory usage than the standard approach. Used in the context of natural language the model achieves a competitive result of 57 PPL on the wikitext-2 dataset. Further optimizations can be applied to get closer to state of the art results.
One of the most challenging environment for sequences-based tasks is biology. Indeed, RNA sequences can be several tens of thousands time steps long. Traditional methods fail there and an alternative approach is required. ReDNA Labs has collaborated with a team at University of Campinas (Brazil), to apply the IGLOO approach in a biological context. Together we have built a tool called RNAsamba which can predict the coding potential of RNA molecules from sequence information. It allows to accurately classify whether a sequence is mRNA or lncRNA (long non coding RNA). This task had been tackled in the past using regression, SVMs and RNNs. Our joint studies shows that on the reference benchmark we use (from INRIA, France), RNAsamba achieves state of the art accuracy classification on a balanced test dataset. Those results extend not only to the human genome but also that of the mouse and other organisms.
The main take away here is that there is now an alternative method to recurrent neural networks and Transformers to deal with sequences which has been successfully tried on industrial real world tasks. This new approach solves some of the drawbacks of the existing ones. ReDNA Labs is now working with clients and partners to extend the use of IGLOO to natural language and image generation. As a boutique research and development firm we are happy to partner on new projects with interested parties and help them solve their neural networks needs.
About Vsevolod Sourkov
Vsevolod is director of research at ReDNA Labs. He received his masters in stochastic calculus from the prestigious University Paris VI in Paris – France. After working with derivatives on financial markets for more than a decade and building financial software, he is now actively developing new neural network architectures and solving real world problems for ReDNA Labs clients. Feel free to reach out for any questions or requirements: firstname.lastname@example.org
Views expressed in this article do not represent the opinion of Synced Review or its editors.
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.