AnalyticDB for MySQL is a cloud-hosted data warehouse that can process petabytes of data with high concurrency and low latency. A gene retrieval system is built based on the vector retrieval feature of AnalyticDB for MySQL. This system can query and analyze billions of vector data records within milliseconds, improving the efficiency of prevention and control of 2019 novel coronavirus (2019-nCoV), and research and development of therapeutic drugs and related vaccines.

Background information

At the end of 2019, the COVID-19 disease broke out in Wuhan. More than 3,300 people died and more than 82,000 people got infected within a little over two months since the beginning of the outbreak. The spread of the epidemic caused more than 800,000 infections and 40,000 deaths across 109 countries. So far, the epidemic has caused government shutdown in over 50 countries and economic losses of hundreds of billions of dollars worldwide. During epidemic prevention and control, Alibaba Cloud has provided an efficient gene sequence retrieval technology to help analyze the gene sequence of 2019-nCoV.

Scenarios and current situation of the gene sequence retrieval technology

The gene sequence retrieval technology is used in the following scenarios:

  • Contact tracing and analysis of COVID-19, helping locate virus hosts and implement effective prevention and control measures.

    Through gene sequencing, it is found that the RNA sequence of 2019-nCoV is a 96% match to coronaviruses in bats and a 99.7% match to coronaviruses in pangolins. It is speculated that pangolins and bats are hosts of 2019-nCoV.

  • Analysis of the process of virus replication and transmission, helping the development of therapeutic drugs and vaccines.

    The gene sequence analysis technology is used to divide gene sequences by function. This helps understand the function of each module, analyze the process of virus replication and transmission, and identify key nodes to help the development of therapeutic drugs and vaccines.

  • Retrieval of gene sequences of viruses similar to 2019-nCoV.

    The gene sequence retrieval technology can also be used to retrieve the gene sequence of viruses similar to 2019-nCoV such as SARS and MERS. This can help learn from the design mechanism of related drug targets, and develop detection kits, vaccines, and related therapeutic drugs in a more efficient manner.

As the epidemic spreads rapidly, the current gene matching algorithm is not sufficient. An efficient matching algorithm is urgently needed for gene sequence retrieval. AnalyticDB for MySQL technical personnel convert gene fragments into 1024-dimensional feature vectors. The process of matching two gene fragments is converted into a calculation of the distance between two vectors. This can reduce computing overheads and return results within milliseconds. This process can be used for preliminary screening of gene fragments. Then, the BLAST algorithm [6] of gene similarity calculation is used to generate a precise similarity ranking, completing the matching calculation of gene sequences in a more efficient manner. The complexity of the matching algorithm is reduced from O(M+N) to O(1). AnalyticDB for MySQL also provides powerful machine learning analysis tools. These tools can convert local and disease-related target gene fragments into feature vectors through the gene-to-vector technology. These vectors can then be used in the research and development of gene medicine, accelerating the process of genetic analysis.

Gene retrieval system of AnalyticDB for MySQL

The RNA sequence of 2019-nCoV can be expressed as a string of nucleotide sequences, which is also called base sequences. The RNA sequence is made up of four nucleotides, labeled A, C, G, and T for adenine, cytosine, guanine, and thymine. Each letter represents a base, and these bases are linked together without gaps. Each species has a unique and regular RNA sequence. The gene retrieval system can retrieve genes similar to the ones submitted to the system and analyze RNA sequences of viruses.

To demonstrate how to use AnalyticDB for MySQL to retrieve gene fragments, a large number of RNA fragments of viruses are downloaded from GenBank and virus-related papers from GenBank and Google Scholar are imported into the AnalyticDB for MySQL gene retrieval database.

The following figure shows the gene retrieval interface in AnalyticDB for MySQL. The 2019-nCoV sequence is uploaded to the gene retrieval system of AnalyticDB for MySQL. Then, the system retrieves similar gene fragments within milliseconds. The system only returns gene fragments with a match degree greater than 0.8. In this case, the pangolin coronavirus (GD/P1L), bat coronavirus (RaTG13), SARS, and MARS are returned. GD/P1L is the best sequence match with a matching degree of 0.974. It is speculated that 2019-nCoV was transmitted to people through pangolins.

If RNA fragments are very similar, the two RNA sequences may have similar protein expressions and structures. The match degrees between SARS and 2019-nCoV and between MARS and 2019-nCoV are greater than 0.8. This indicates that some research results of SARS or MARS can be used to better understand 2019-nCoV. The system crawls academic papers about each matched virus and divides these papers into the testing, vaccine, and medication categories through the text classification algorithm. The following figure shows seven SARS testing methods, four vaccination methods, and ten therapeutic drugs. One of the testing methods for SARS is fluorescence quantitative PCR detection. This method is used to test 2019-nCoV. The gene vaccine and in vivo induction of immune response methods are under development. Remdesivir and relevant interferons are used to treat COVID-19 patients.

Click the interferon paper link in the preceding figure to go to the interferon paper. The system uses automatic translation software to translate the papers and extracts keywords from the Chinese file names to translate these file names. This makes it easier for you to understand the materials.

Architecture

AnalyticDB for MySQL is used in the gene retrieval system to store and analyze feature vectors produced for gene sequences and all structured data such as gene sequence lengths that contain academic paper names, gene types, and DNA or RNA. During the query process, a gene vector extraction model is used to convert genes into vectors and perform coarse sorting retrieval in the AnalyticDB for MySQL vector database. In the vector matching result set, the BLAST algorithm [7] is used to perform precise sorting and return the most similar gene sequences.

The core of the gene retrieval system of AnalyticDB for MySQL is the gene vector extraction model. This model can convert nucleotide sequences to vectors. AnalyticDB for MySQL extracts and trains all the sequence samples of various viral RNA to help the model better calculate the similarity of viral RNA. The gene vector extraction model can be easily extended to genes of other species.

Gene vector extraction algorithm

Word vector technology is already widely implemented in fields such as machine translation, reading comprehension, and semantic analysis with great success. Word vectorization uses a distributional semantic approach to express the meaning of a word. The meaning of a word is its context. Think back to tests where you have to use the words in a wordbank to fill in missing words in a paragraph. In these tests, the context of a word can accurately reflect the word itself. If you choose the correct word, it indicates that you understand the meaning of the vacant word. Therefore, a word vector algorithm can generate a vector for each word in a text through the relationship of a given word with surrounding words. Then, the similarity of word vectors can be calculated to obtain the similarity between words.

Similarly, gene sequences follow certain rules, and each part of a gene sequence expresses different functions and meanings. Therefore, a long gene sequence can be divided into smaller units ("words") for research purposes. These "words" also have a context, because they are interconnected and interact with each other to complete corresponding functions and form expressions. Therefore, biological scientists [10] use the word vector algorithm to vectorize gene sequence units. A high similarity between two gene units indicates that both gene units always appear together and jointly express a corresponding function.

Generally, the gene vector extraction algorithm of AnalyticDB for MySQL involves the following steps:
  1. Define words in an amino acid sequence

    In the bioinformatics field, k-mers [3] are used to analyze amino acid sequences. K-mers are obtained after a nucleotide sequence is divided into strings that contain K bases. This is done by iteratively selecting a sequence of K bases in length from a continuous nucleotide sequence. If the length of the nucleotide sequence is L, the following number of k-mers can be obtained: L - K + 1. The following figure shows that if the length of a sequence is 12 and the k-mer length is 8, five 8-mers can be obtained. The formula is as follows: 12 - 8 + 1 = 5. These k-mers are equivalent to the "words" in the amino acid sequence.

  2. Find the context of the amino acid sequence and convert the "words" of the gene sequence into 1024-dimensional vectors.

    The context plays an important role in word vector algorithms. The gene vector extraction algorithm of AnalyticDB for MySQL selects a window with a length of L from amino acid fragments. The amino acid fragments in this window are considered to be within the same context. For example, if a window with a length of 10 is selected for the nucleotide sequence CTGGATGA, the gene vector extraction algorithm of AnalyticDB for MySQL converts CTGGATGA into the following 5-mers: AACTG, ACTGG, CTGGA, GGATG, and GATGA. For CTGGA, the other four 5-mers compose the context of CTGGA. The gene vector extraction algorithm of AnalyticDB for MySQL uses a word vector space training model to train the existing genetic k-mers, and convert k-mers into 1024-dimensional vectors.

  3. Similar to word vector models, k-mer vector models also perform mathematical computations on vectors.
    • Vector subtraction: 1
    • Vector addition: 2

    The vector subtraction formula indicates that the distance between "ACGAT vector minus GAT vector" and the AC vector is very close. The vector addition formula indicates that the distance between "AC vector plus ATC vector" and the ACATC vector is also very close. When you calculate the vector of a long amino acid sequence, the k-mer sequences of each fragment can be added into this sequence based on these mathematical characteristics. Then, you can normalize the result to obtain the vector of the whole amino acid sequence. To improve the accuracy of this approach, you can consider a gene fragment as a text fragment and use doc2vec [4] to convert the whole sequence into a vector for calculation. To verify the performance of the algorithm, the gene vector extraction algorithm calculates the similarity between the BLAST algorithm [6] sequence and the vector-to-gene l2 distance sequence. The Spearman rank correlation coefficient [7] of both sequences is 0.839. This shows that converting DNA sequences into vectors is an effective method of preliminary screening for similar gene fragments.

Overview of vector retrieval

In general application systems that involve vector retrieval, developers use a vector search engine such as Faiss to store vector data and then use relational databases to store structured data. You must alternate between both systems during queries. This solution requires extra development efforts and does not provide optimal data query performance.

AnalyticDB for MySQL is a cloud-hosted data warehouse that can process petabytes of data with high concurrency and low latency. It can query billions of vector data records within milliseconds and return responses within 100 milliseconds. AnalyticDB for MySQL is fully compatible with the MySQL protocol and the SQL:2003 syntax. It provides a vector retrieval function to support similarity query and analysis for images, text recommendations, voiceprints, and nucleotide sequences. AnalyticDB for MySQL has been widely used in security projects across multiple cities.

AnalyticDB for MySQL supports the retrieval and analysis of structured and non-structured data. You can use an SQL interface to build systems such as a gene retrieval system or hybrid retrieval system for gene and structured data. In hybrid retrieval scenarios, the optimizer of AnalyticDB for MySQL selects the optimal execution plan based on the data distribution and query conditions to achieve optimal performance while ensuring the recall rate. For example, you can use the following SQL statement to retrieve an RNA nucleotide sequence:

-- Query gene segments that are similar to the submitted sequence vectors within the RNA sequence.
select title, # The article name.
        length, # (#) The gene length.
        type, # mRNA or DNA.
        l2_distance(feature, array[-0.017,-0.032,...]::real[]) as distance # The vector distance. 
from demo.paper a, demo.dna_feature b
where a.id = b.id
order by distance; # Sort by vector similarity.

In the preceding SQL statement, the demo.paper table stores the basic information of each uploaded article, and the demo.dna_feature table stores the vectors that correspond to gene sequences of each species. The gene-to-vector model is used to convert genes to vectors such as [-0.017,-0.032,...] and these vectors can be used for retrieval in AnalyticDB for MySQL databases.

The current system also supports hybrid retrieval of structured and non-structured information (nucleotide sequences). For example, you only need to add where title like'%COVID-19%' to the SQL statement to query gene segments that are similar to 2019-nCoV.

Appendix

  • [1] Mikolov Tomas; et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781.
  • [2] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado, Greg S. and Dean Jeff (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. arXiv:1310.4546. Bibcode:2013arXiv1310.4546M.
  • [3] Mapleson Daniel, Garcia Accinelli, Gonzalo, Kettleborough George, Wright Jonathan and Clavijo, Bernardo J. (2016). "KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies". Bioinformatics. 33(4): 574-576. doi:10.1093/bioinformatics/btw663. ISSN 1367-4803. PMC 5408915. PMID 27797770.
  • [4] Quoc Le and Tomas Mikolov. (2014). Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188-1196.
  • [5] The Human Genome HG38, http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz.
  • [7] Julia Piantadosi, Phil Howlett and John Boland. (2007). "Matching the grade correlation coefficient using a copula with maximum disorder", Journal of Industrial and Management Optimization, 3 (2), 305-312.
  • [8] Stephen Woloszynek, Zhengqiao Zhao, Jian Chen and Gail L. Rosen. (2019). "16s rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses", PLoS Computational Biology, 15(2), e1006721.
  • [9] James K. Senter, Taylor M. Royalty, Andrew D. Steen and Amir Sadovnik. (2019) "Unaligned Sequence Similarity Search Using Deep Learning.", arXiv e-prints.
  • [10] Ng Patrick. (2017) dna2vec: consistent vector representations of variable-length k-mers. arXiv preprint, arXiv:1701.06279.