Information provided by Baidu 11/05/20
Scientists are racing to develop a vaccine to prevent the COVID-19 pandemic, which has sickened over four million people and caused over 280,000 deaths globally. Among all vaccines under the development, the mRNA vaccine has emerged as a promising preventive tool because of its rapid and scalable production. The US biotechnology company Moderna has begun human clinical trials evaluating its mRNA vaccine for COVID-19 and will start phase 2 testing soon. The widespread adoption of mRNA vaccines had previously been restricted due to instability which could cause degradation and low protein expression levels.
Recent findings have proved that the secondary structure of a mRNA sequence can lead to a more stable and productive mRNA vaccine, but finding such a sequence with a robust secondary structure remains a difficult challenge because there are exponentially many mRNA sequences that encode the same protein.
While this is a typical bioinformatics problem, Baidu believes that by designing efficient algorithms we can improve the mRNA vaccine development. We are proud to announce LinearDesign, an efficient algorithm for optimized mRNA sequence design. The algorithm needs only 16 minutes to design a stable mRNA sequence that has substantially better stability compared to wildtype sequences or randomly generated designs.
We have launched an easy-to-use LinearDesign webserver for public use so that biotech companies and research institutes can utilize our technology, with the paper also released on arXiv.
“The LinearDesign algorithm, developed by Baidu Research in collaboration with Oregon State University and University of Rochester, can theoretically design the mRNA sequence with the most stable secondary structure, helping many mRNA vaccine companies to optimize their vaccine sequence designs,” says Liang Huang, Distinguished Scientist at Baidu USA.
LinearDesign is our latest anti-pandemic research effort, inspired by our previous project, LinearFold, the world’s fastest algorithm for RNA secondary structure prediction. LinearFold significantly speeds up analysis of SARS-COV-2 — the virus that has caused the COVID-19 pandemic — from 55 minutes to 27 seconds.
Baidu has also signed a strategic partnership with China’s CDC NIVDC (National Institute for Viral Disease Control and Prevention) to support anti-pandemic efforts and long-term public health. Baidu will provide AI and big data technologies, including LinearFold and LinearDesign, for genome analysis and vaccine R&D, while jointly establishing a genome sequencing workstation with the NIVDC’s emergency tech centre.
The vaccine development for COVID-19 is bound to be a long and challenging journey, making an international collaboration of crucial importance to vaccines. We encourage scientists and researchers to work with us and move quickly to bring a safe and efficacious vaccine to patients.
Why is an mRNA vaccine important to prevent the spread of COVID-19?
As one of the most effective ways to prevent diseases, a vaccine stimulates the body’s immune system to recognize and fight pathogens like viruses or bacteria or any associated microorganisms.
The research field is pursuing emerging techniques for more rapid development and large-scale deployment because a wide variety of infectious diseases like COVID-19 are evolving so rapidly that variants may emerge even before vaccines are produced.
The most common type of vaccine is a protein vaccine, but its manufacturing process often takes too long, rendering it less desirable for the current pandemic. A DNA vaccine meanwhile benefits from faster production but suffers from safety issues due to its potential integration into the human genome.
The relative benefits of an mRNA vaccine — which refers to the direct injection of messenger RNA that is translated into proteins in the human body — include safety, rapid and scalable production, and non-infectious and non-integrating properties.
“The reason one might want to use an mRNA is that it should stimulate the immune system in a much more similar way to a real viral infection. And that’s advantageous because then the immune system is going to be recognizing a real viral infection much more easily,” says Dr. David H. Mathews, a professor in the University of Rochester Department of Biochemistry and Biophysics.
Biotech company Moderna recently announced its candidate for a coronavirus vaccine will be evaluated further. The company has submitted a new drug application with the US Food and Drug Administration to evaluate its mRNA-1273 vaccine candidate in a more extensive study if warranted by safety data from an initial study.
Why have we developedLinearDesign?
Despite the promising potential of mRNA vaccines, major hurdles remain for designing an mRNA sequence that achieves high stability and protein production — both of which are critical for vaccines.
It’s know that mRNA vaccines may fail due to degradation during storage and transportation and resultant low protein levels. The mRNA vaccine generates proteins by translating the mRNA in the body, and how much protein it can synthesize is directly related to the immune effect.
In a recent paper published on Proceedings of the National Academy of Sciences in the US, Moderna’s research team demonstrated that secondary structures and codon optimality can increase mRNA stability and protein expression. The problem can therefore be formulated to finding mRNA sequences that are good in both secondary structure stability and codon optimality among the exponentially many synonymous sequences that encode the same protein. This is undoubtedly difficult.
Each amino acid is translated by a codon, which is three adjacent mRNA nucleotides. For example, the start codon AUG translates into methionine, the first amino acid in any protein sequence. But due to redundancies in the genetic code (43 =64 triplet codons for 21amino acids), most amino acids can be translated from multiple codons. This causes the mRNA design search space to increase exponentially with protein length. The SARS-CoV-2 spike protein contains 1,273 amino acids (plus the stop codon, which is part of the mRNA but not part of a protein), meaning there are about 10632 mRNA candidates.
To find an optimal mRNA sequence, scientists have traditionally made random changes to a sequence and then seen if they were beneficial. The scientific community is now seeking different approaches to solve the problem. For example, Eterna, a browser-based gaming platform developed by Stanford University, is assembling online gamers to develop a safe mRNA vaccine by solving puzzles. Eterna has been using Baidu’s LinearFold algorithm to accelerate secondary structure analysis.
LinearFold is a successful project that translates a biological challenge into a classical problem in computational linguistics. Inspired by LinearFold, our research team came to the idea of using computer science to find more stable and productive mRNA sequences than the wild type in nature. That’s how LinearDesign was developed.
“LinearDesign is software that designs a set of sequences that have structure and use easily read codons. Its speed is key in providing a set of good sequences that can be tested by experiment for their ability to work as vaccines,” says Dr. Mathews.
How does LinearDesign work?
Essentially, we use a dynamic programming algorithm to reduce the search space from exponential to polynomial. We first use a Deterministic Finite Automaton (DFA), a directed graph with labelled edges and distinct start and end nodes, to express amino acids and proteins. Shown in the figure below are four examples of DFA representations for amino acids, with each representing one amino acid.
Next, we concatenate them into a single DFA D(p) for a protein sequence p, which represents all possible mRNA sequences that translate into that protein D(p) = D(p1) ◦ D(p2) ◦ ··· ◦ D(pm) ◦ D(STOP) by stitching the end node of each DFA with the start node of the next.
We need to find the mRNA sequence with the most stable secondary structure through DFA. Here we borrowed a tool from computational linguistics, stochastic context-free grammar (SCFG), which is used to represent RNA folding. The mRNA design problem is now a simple extension of the single-sequence folding problem to the case of multiple inputs. We find the minimum free energy structure (and its corresponding sequence) among all possible structures for all possible sequences. This can be solved by intersecting the SCFG on the protein DFA.
The optimization of mRNA vaccine sequence design is actually to extend the secondary structure calculation (RNA folding) of a single RNA sequence to multiple RNA sequences. After we abstract multiple RNA sequences with DFA, we find the sequence with the most stable secondary structure from multiple mRNA sequences by taking the intersection of DFA and SCFG.
The following figure shows an example of how the DFA and SCFG intersect to generate the sequence of “methionine leucine stop” as “AUGCUGUGA”.
On this basis, our algorithm has also been extended in the following aspects:
(1) Borrowing the LinearFold idea to further reduce computational complexity from the cubic complexity to linearity, greatly reducing the time required to design the mRNA sequence;
(2) From providing an optimal mRNA sequence to providing the top k suboptimal mRNA sequences as alternatives. Vaccine companies can select the most suitable vaccine sequence from these alternatives;
(3) Simultaneously optimize the secondary structure stability and codon optimality, and design an mRNA vaccine sequence with good stability and high protein expression efficiency.
Our experiment results show that LinearDesign can efficiently design mRNA sequences. For the SARS-CoV-2 spike protein, LinearDesign can finish the mRNA sequence design in 1.6 hours with exact search. With linear-time approximation, the design time is shortened to 16 minutes (b=1,000) and 78 seconds (b=100).
We also compared the stability of our designed sequences with wildtype and random generated sequences. The wildtype sequence, denoted in a red circle, folds into a structure with the minimum free energy change of –967.8 kcal/mol. The most random sequences, denoted in blue cloud and orange cloud, have similar free energy changes (-987.9 kcal/mol and -1063.23 kcal/mol on average, respectively) as the wildtype. The sequence designed by LinearDesign in exact search has the lowest MFE of -2,477.70 kcal/mol (less energy indicating more stability). With only 0.56% MFE loss from the exact search sequence, the designed sequence with beam size b = 1, 000 achieves an MFE of -2,463.8 kcal/mol.
The results of MFE and CAI joint optimization, shown in light-blue curve and magenta curve, are also astonishing. We see that the curve is on the top-left of the figure, indicating that the sequences on the curve have both stable secondary structures and high expression levels. In fact, this curve is the accessible boundary of all possible sequences, i.e., no sequences can achieve the region beyond (to the top-left) the curve. The points on the curve are good candidates for mRNA vaccine. For example, the point with λ = 100, has the free energy change of -2,414.6 kcal/mol and CAI of 0.823, which is only 2.5% away from the optimal MFE sequence but with 0.097 increase in CAI. Shifting right from the light-blue curve with a small margin, the magenta curve is the result of joint optimization using b = 1, 000. This curve shows that the approximation quality is good with b = 1, 000.
The figure above shows the secondary structures of the wildtype sequences, our designed sequences with b = 1, 000 and b = +∞, as well as designed sequences with an absence of base pairing in the 5’-end leader regions.