Alibaba Cloud developed a gene sequence analysis system for the healthcare industry based on the vector retrieval feature of AnalyticDB for MySQL. The system can query and analyze billions of vector data records in milliseconds, accelerating research and development of therapeutic drugs and related vaccines to treat and prevent COVID-19.
Use cases for gene sequence retrieval technology
The gene sequence analysis technology is used in the following scenarios:
Contact tracing and analysis of COVID-19 (2019-nCoV), helping locate virus hosts and implement effective prevention and control measures. Through gene sequencing, it was found that the RNA sequence of 2019-nCoV is a 96% match with coronaviruses in bats and a 99.7% match with coronaviruses in pangolins. It has been speculated that pangolins and bats are hosts of 2019-nCoV.
Analysis of the process of virus replication and transmission, helping the development of therapeutic drugs and vaccines. The gene sequence analysis technology is used to divide gene sequences by function. This helps researchers understand the function of each module, analyze the process of virus replication and transmission, and identify key nodes to help the development of therapeutic drugs and vaccines.
Retrieval of gene sequences of viruses similar to 2019-nCoV. The gene sequence retrieval technology can also be used to retrieve the gene sequence of viruses similar to 2019-nCoV such as SARS and MERS. This can help researchers learn from the design mechanism of related drug targets, and develop detection kits, vaccines, and related therapeutic drugs in a more efficient manner.
At the start of 2020, an efficient matching algorithm was urgently needed for gene sequences.
A large number of RNA fragments of different viruses were first downloaded from GenBank and Google Scholar and imported into the AnalyticDB for MySQL gene retrieval database.
The 2019-nCoV sequence was uploaded to the gene retrieval system of AnalyticDB for MySQL. The system was able to retrieve similar gene fragments within milliseconds. The system only returns gene fragments with a match degree greater than 0.8. With this condition, the pangolin coronavirus (GD/P1L), bat coronavirus (RaTG13), SARS, and MARS were returned. GD/P1L is the best sequence match with a matching degree of 0.974. It was therefore speculated that2019-nCoV was transmitted to people through pangolins.
If RNA fragments are very similar, the two RNA sequences may have similar protein expressions and structures. The match degrees between SARS and 2019-nCoV and between MARS and 2019-nCoV are greater than 0.8. This indicates that some research results of SARS or MARS can be used to better understand 2019-nCoV.The system crawls academic papers about each matched virus and divides these papers into the testing, vaccine, and medication categories through the text classification algorithm. One of the testing methods for SARS is fluorescence quantitative PCR detection. This method was used to test 2019-nCoV.
Gene vector extraction algorithm
AnalyticDB for MySQL technical personnel convert gene fragments into 1024-dimensional feature vectors. The process of matching two gene fragments is converted into a calculation of the distance between two vectors. This can reduce computing overheads and return results within milliseconds. This process can be used for the preliminary screening of gene fragments. Then, the BLAST algorithm  of gene similarity calculation is used to generate a precise similarity ranking, completing the matching calculation of gene sequences in a more efficient manner. The complexity of the matching algorithm is reduced from O(M+N) to O(1). AnalyticDB for MySQL also provides powerful machine learning analysis tools. These tools can convert local and disease-related target gene fragments into feature vectors through the gene-to-vector technology. These vectors can then be used in the research and development of gene medicine, accelerating the process of genetic analysis.
Word vector technology is already widely implemented in fields such as machine translation, reading comprehension, and semantic analysis with great success. Word vectorization uses a distributional semantic approach to express the meaning of a word. The meaning of a word is its context. Think back to tests where you have to use the words in a wordbank to fill in missing words in a paragraph. In these tests, the context of a word can accurately reflect the word itself. If you choose the correct word, it indicates that you understand the meaning of the vacant word. Therefore, a word vector algorithm can generate a vector for each word in a text through the relationship of a given word with surrounding words. Then, the similarity of word vectors can be calculated to obtain the similarity between words.
Similarly, gene sequences follow certain rules, and each part of a gene sequence expresses different functions and meanings. Therefore, a long gene sequence can be divided into smaller units ("words") for research purposes. These"words" also have a context, because they are interconnected and interact with each other to complete corresponding functions and form expressions. Therefore, biological scientists  use the word vector algorithm to vectorize gene sequence units. A high similarity between two gene units indicates that both gene units always appear together and jointly express a corresponding function.
Generally, the gene vector extraction algorithm of AnalyticDB for MySQL involves the following steps:
Define words in an amino acid sequence. In the bioinformatics field, k-mers  are used to analyze amino acid sequences. K-mers are obtained after a nucleotide sequence is divided into strings that contain K bases. This is done by iteratively selecting a sequence of K bases in length from a continuous nucleotide sequence. If the length of the nucleotide sequence is L, the following number of k-mers can be obtained: L - K + 1. The following figure shows that if the length of a sequence is 12 and the k-mer length is 8, five 8-mers can be obtained. The formula is as follows: 12 - 8 + 1= 5. These k-mers are equivalent to the "words" in the amino acid sequence.
Find the context of the amino acid sequence and convert the "words" of the gene sequence into 1024-dimensional vectors. The context plays an important role in word vector algorithms. The gene vector extraction algorithm of AnalyticDB for MySQL selects a window with a length ofL from amino acid fragments. The amino acid fragments in this window are considered to be within the same context. For example, if a window with a length of 10 is selected for the nucleotide sequence CTGGATGA, the gene vector extraction algorithm of AnalyticDB for MySQL converts CTGGATGA into the following 5-mers: AACTG, ACTGG, CTGGA, GGATG, and GATGA. For CTGGA, the other four 5-mers compose the context of CTGGA. The gene vector extraction algorithm of AnalyticDB for MySQL uses a word vector space training model to train the existing genetic k-mers, and convert k-mers into 1024-dimensional vectors.
Similar to word vector models, k-mer vector models also perform mathematical computations on vectors.
The vector subtraction formula indicates that the distance between "ACGATvector minus GAT vector" and the AC vector is very close. The vector addition formula indicates that the distance between "AC vector plus ATCvector" and the ACATC vector is also very close. When you calculate the vector of a long amino acid sequence, the k-mer sequences of each fragment can be added into this sequence based on these mathematical characteristics. Then, you can normalize the result to obtain the vector of the whole amino acid sequence. To improve the accuracy of this approach, you can consider a gene fragment as a text fragment and use doc2vec  to convert the whole sequence into a vector for calculation. To verify the performance of the algorithm, the gene vector extraction algorithm calculates the similarity between the BLAST algorithm  sequence and the vector-to-gene l2 distance sequence. The Spearman rank correlation coefficient  of both sequences is 0.839. This shows that converting DNA sequences into vectors is an effective method of preliminary screening for similar gene fragments.
Overview of vector retrieval
In general application systems that involve vector retrieval, developers use a vector search engine such as Faiss to store vector data and then use relational databases to store structured data. You must alternate between both systems during queries. This solution requires extra development efforts and does not provide optimal data query performance.
AnalyticDB for MySQL provides a vector retrieval function to support similarity query and analysis for images, text recommendations, voiceprints, and nucleotide sequences. AnalyticDB for MySQL supports the retrieval and analysis of structured and non-structured data. You can use an SQL interface to build systems such as a gene retrieval system or hybrid retrieval system for gene and structured data. In hybrid retrieval scenarios, the optimizer of AnalyticDB for MySQL selects the optimal execution plan based on the data distribution and query conditions to achieve optimal performance while ensuring the recall rate. For example, you can use the following SQL statement to retrieve an RNA nucleotide sequence:
-- Query gene segments that are similar to the submitted sequence vectors within the RNA sequence. select title, # The article name. length, # (#) The gene length. type, # mRNA or DNA. l2_distance(feature, array[-0.017,-0.032,...]::real) as distance # The vector distance. from demo.paper a, demo.dna_feature b where a.id = b.id order by distance; # Sort by vector similarity.
In the preceding SQL statement, the demo.paper table stores the basic information of each uploaded article, and the demo.dna_feature table stores the vectors that correspond to gene sequences of each species. The gene-to-vector model isused to convert genes to vectors such as [-0.017,-0.032,...] and these vectors can be used for retrieval in AnalyticDB for MySQL databases.
The current system also supports hybrid retrieval of structured and non-structured information (nucleotide sequences). For example, you only need to add
where title like'%COVID-19%' to the SQL statement to query gene segments that are similar to 2019-nCoV.
 Mikolov Tomas; et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781.
 Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado, Greg S. and Dean Jeff (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. arXiv:1310.4546. Bibcode:2013arXiv1310.4546M.
 Mapleson Daniel, Garcia Accinelli, Gonzalo, Kettleborough George, Wright Jonathan and Clavijo, Bernardo J. (2016). "KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies". Bioinformatics. 33(4): 574-576. doi:10.1093/bioinformatics/btw663. ISSN 1367-4803. PMC 5408915. PMID 27797770.
 Quoc Le and Tomas Mikolov. (2014). Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188-1196.
 The Human Genome HG38,
 Julia Piantadosi, Phil Howlett and John Boland. (2007). "Matching the grade correlation coefficient using a copula with maximum disorder", Journal of Industrial and Management Optimization, 3 (2), 305-312.
 Stephen Woloszynek, Zhengqiao Zhao, Jian Chen and Gail L. Rosen. (2019). "16s rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses", PLoS Computational Biology, 15(2), e1006721.
 James K. Senter, Taylor M. Royalty, Andrew D. Steen and Amir Sadovnik. (2019) "Unaligned Sequence Similarity Search Using Deep Learning.", arXiv e-prints.
 Ng Patrick. (2017) dna2vec: consistent vector representations of variable-length k-mers. arXiv preprint, arXiv:1701.06279.