Site icon Synced

Baidu Open-Sources RNA Prediction Algorithm for 2019 Novel Coronavirus

In response to the 2019 Novel Coronavirus (2019-nCoV) outbreak, China tech giant Baidu has open-sourced its RNA (Ribonucleic acid) prediction algorithm LinearFold. The tool can significantly accelerate the prediction time of a virus’s RNA secondary structure, affording frontline researchers an opportunity to better understand the virus and develop targeting vaccines in a time of crisis. Baidu’s AI scientists told Synced that they have already applied LinearFold to the 2019-nCoV, reducing prediction time from 55 minutes to 27 seconds.

Knowledge of RNA secondary structures has become essential for researchers seeking to model RNA structures and gain insight into related functional mechanisms. However, current algorithms borrowed from computational linguistics and based on dynamic programming face a challenge: increases in RNA length cause their runtimes to scale in a cubic manner. This can seriously slow such algorithms when facing long RNA sequences and limit their use in genome-wide applications.

In the paper LinearFold: Linear-time Approximate RNA folding by 5′-to-3′ Dynamic Programming and Beam Search, Baidu researchers propose LinearFold as the first approximate algorithm in RNA folding to achieve linear runtime and linear space without imposing constraints such as base-pair distance on the output structure. A significant difference between LinearFold and previous algorithms is that it scans a sequence from left-to-right, for example from 5′ to 3′, instead of from bottom-up. This approach was inspired by incremental parsing for context-free grammar in computational linguistics and enables the model to use an efficient beam pruning heuristic.

The Baidu researchers evaluated their design on a diverse dataset of RNA sequences, with results showing the approach is more efficient and leads to higher average accuracies over all families. For example, for a sequence of approximately 10 000nt (nucleotide) such as the HIV genome, LinearFold takes only 8 sec of runtime, while baselines take about 4 min. For a sequence of 32 753nt, LinearFold takes 26 sec, while widely used systems CONTRAfold and RNAfold take 2 and 1.7 hrs, respectively.

Another notable finding is that LinearFold is more accurate on long-range base pairs, which have been a persistent problem for most current models. This is important as the 2019-nVoV genes have a total length of almost 30,000 base pairs, which is longer for example than distant cousin the 2003 SARS coronavirus.

The release of LinearFold will help institutions such as genetic testing agencies, epidemic prevention centers and other scientific research institutes, and is the latest initiative from the machine learning research community aimed at decoding critical information underlying the 2019-nCoV outbreak. Synced is watching this topic closely and will continue to update readers as new information becomes available.

The paper LinearFold: Linear-time Approximate RNA Folding by 5′-to-3′ Dynamic Programming and Beam Search is available on Bioinformatics, and the source code is available at GitHub.


Journalist: Fangyu Cai | Editor: Michael Sarazen

Exit mobile version