Efficient gene sequence retrieval empowers quick analysis of the pneumonia virus -

Scenarios and current situation of the gene sequence retrieval technology

The gene sequence retrieval technology is used in the following scenarios:

Contact tracing and analysis of the pneumonia virus, helping locate virus hosts and implement effective prevention and control measures.
Analysis of virus replication and transmission, helping the development of therapeutic drugs and vaccines.
Retrieval of gene sequences of viruses similar to the pneumonia virus.

As the virus spreads rapidly, an efficient matching algorithm is urgently needed for gene sequence retrieval. In this context, the AnalyticDB for MySQL technical team converted gene fragments into 1024-dimensional feature vectors, which means that the process of matching two gene fragments is converted into a calculation of the distance between two vectors. This can reduce computing overheads and shorten the time required to return results into milliseconds. This process can be used for preliminary screening of gene fragments. Then, the BLAST algorithm of gene similarity calculation is used to generate a precise similarity ranking, completing the matching calculation of gene sequences in a more efficient manner. The complexity of the matching algorithm is reduced from O(M+N) to O(1). AnalyticDB for MySQL also provides powerful machine learning analysis tools. These tools can convert local and disease-related target gene fragments into feature vectors through the gene-to-vector technology. These vectors can then be used in the research and development of gene drugs to accelerate the process of genetic analysis.

Gene retrieval system of AnalyticDB for MySQL

The RNA sequence of the pneumonia virus can be expressed as a string of nucleotide sequences, which is also called base sequences. The RNA sequence is made up of four nucleotides, labeled A, C, G, and T for adenine, cytosine, guanine, and thymine. Each letter represents a base, and these bases are linked together without gaps. Each species has its unique RNA sequence, but patterns can be found. The gene retrieval system can retrieve genes similar to the ones submitted to the system and analyze the RNA sequence of a specific virus.

To demonstrate how to use AnalyticDB for MySQL to retrieve gene fragments, AnalyticDB for MySQL technical team imported a large amount of virus RNA fragment data from GenBank and virus-related papers from GenBank and Google Scholar into the AnalyticDB for MySQL gene retrieval database.

Then, the technical team uploaded the pneumonia virus sequence to the gene retrieval system of AnalyticDB for MySQL. Milliseconds later, the similar gene fragments with a match degree greater than 0.8 were returned, including the pneumonia virus carried by pangolins (GD/P1L), pneumonia virus carried by bats (RaTG13), SARS virus, and MARS virus. GD/P1L was the best sequence match with a match degree of 0.974. It is speculated that the pneumonia virus was transmitted to people through pangolins.

If RNA fragments are very similar, the two RNA sequences may have similar protein expressions and structures. The match degrees between the pneumonia virus and SARS virus and between the pneumonia virus and MARS virus are greater than 0.8. This indicates that some research results of the SARS or MARS virus can be used to better understand the pneumonia virus. The system obtains academic papers about each matched virus and divides these papers into the testing, vaccine, and medication categories through the text classification algorithm. One of the testing methods for SARS is fluorescence quantitative PCR detection. This method is used to test the pneumonia virus. The gene vaccine and in vivo induction of immune response methods are under development. Remdesivir and relevant interferons are used to treat the pneumonia virus patients.

Architecture

AnalyticDB for MySQL is used in the gene retrieval system to store and query feature vectors produced for gene sequences and all structured data such as gene sequence lengths that contain academic paper names, gene types, and DNA or RNA. During the query process, a gene vector extraction model is used to convert genes into vectors and perform coarse sorting retrieval in the AnalyticDB for MySQL vector database. In the vector matching result set, the BLAST algorithm is used to perform precise sorting and return the most similar gene sequences.

The core of the gene retrieval system of AnalyticDB for MySQL is the gene vector extraction model. This model can convert nucleotide sequences into vectors. AnalyticDB for MySQL extracts and trains all the RNA sequence samples of a variety of viruses to help the model better calculate the RNA similarity between viruses. The gene vector extraction model can be easily extended to genes of other species.

Gene vector extraction algorithm

Word vector technology is already widely implemented in fields such as machine translation, reading comprehension, and semantic analysis with great success. Word vectorization uses a distributional semantic approach to express the meaning of a word. The meaning of a word lies in its context. Think back to tests where you have to use the words in a wordbank to fill in missing words in a paragraph. In these tests, the context of a word can accurately reflect the word itself. If you choose the correct word, it indicates that you understand the meaning of the vacant word. Therefore, a word vector algorithm can generate a vector for each word in a text through the relationship of a given word with surrounding words. Then, the similarity of word vectors can be calculated to obtain the similarity between words.

Similarly, gene sequences follow specific rules, and each part of a gene sequence expresses different functionalities and meanings. A long gene sequence can be divided into smaller units ("words") for research purposes. These "words" also have a context, because they are interconnected and interact with each other to complete corresponding functionalities and form expressions. Biological scientists use the word vector algorithm to vectorize gene sequence units. A high similarity between two gene units indicates that both gene units always appear together and jointly express a functionality.

Typically, the gene vector extraction algorithm of AnalyticDB for MySQL involves the following steps:

Define words in an amino acid sequence.
In the bioinformatics field, k-mers are used to analyze amino acid sequences. K-mers are obtained after a nucleotide sequence is divided into strings that contain K bases. This is done by iteratively selecting a sequence of K bases in length from a continuous nucleotide sequence. If the length of the nucleotide sequence is L, the following number of k-mers can be obtained: L - K + 1. If the length of a sequence is 12 and the k-mer length is 8, five 8-mers can be obtained from the following formula: 12 - 8 + 1. These k-mers are equivalent to the "words" in the amino acid sequence.
Find the context of the amino acid sequence and convert the "words" of the gene sequence into 1024-dimensional vectors.
The context plays an important role in word vector algorithms. The gene vector extraction algorithm of AnalyticDB for MySQL selects a window with a length of L from amino acid fragments. The amino acid fragments in this window are considered to be within the same context. For example, if a window with a length of 10 is selected for the nucleotide sequence CTGGATGA, the gene vector extraction algorithm of AnalyticDB for MySQL converts CTGGATGA into the following 5-mers: AACTG, ACTGG, CTGGA, GGATG, and GATGA. For CTGGA, the other four 5-mers compose its context. The gene vector extraction algorithm of AnalyticDB for MySQL uses a word vector space training model to train the existing genetic k-mers and convert k-mers into 1024-dimensional vectors.
Similar to word vector models, k-mer vector models also perform mathematical computations on vectors.
- Vector subtraction:
- Vector addition:
The vector subtraction formula indicates that the distance between "ACGAT vector minus GAT vector" and the AC vector is very close. The vector addition formula indicates that the distance between "AC vector plus ATC vector" and the ACATC vector is also very close. When you calculate the vector of a long amino acid sequence, the k-mer sequences of each fragment can be added to this sequence based on these mathematical characteristics. Then, you can normalize the result to obtain the vector of the whole amino acid sequence. To improve the accuracy of this approach, you can consider a gene fragment as a text fragment and use doc2vec to convert the whole sequence into a vector for calculation. To verify the performance of the algorithm, the gene vector extraction algorithm of AnalyticDB for MySQL calculates the similarity between the BLAST algorithm sequence and the vector-to-gene l2 distance sequence. The Spearman rank correlation coefficient of both sequences is 0.839. This shows that converting DNA sequences into vectors is an effective method of preliminary screening for similar gene fragments.

Overview of vector retrieval

In general application systems that involve vector retrieval, developers use a vector search engine such as Faiss to store vector data and then use relational databases to store structured data. You must use the two systems to query different data. This solution requires extra development efforts and does not provide optimal data query performance.

AnalyticDB for MySQL is a cloud-hosted data warehousing service that can process petabytes of data with high concurrency and low latency. It can perform queries on billions of vector data records within milliseconds and return responses within 100 milliseconds. AnalyticDB for MySQL is fully compatible with the MySQL protocol and the SQL:2003 syntax. It provides a vector retrieval feature to support similarity query and analysis for images, text recommendations, voiceprints, and nucleotide sequences. AnalyticDB for MySQL has been widely used in security projects across multiple cities.

AnalyticDB for MySQL supports the retrieval and analysis of structured and non-structured data. You can use an SQL interface to build systems such as the gene retrieval system or hybrid retrieval system for gene and structured data. In hybrid retrieval scenarios, the optimizer of AnalyticDB for MySQL selects the optimal execution plan based on the data distribution and query conditions to achieve optimal performance while ensuring the recall rate. For example, you can use the following SQL statement to retrieve an RNA nucleotide sequence:

-- Query gene segments that are similar to the submitted sequence vectors within the RNA sequence. 
select title, # The article name.
        length, # (#) The gene length.
        type, # mRNA or DNA.
        l2_distance(feature, array[-0.017,-0.032,...]::real[]) as distance # The vector distance. 
from demo.paper a, demo.dna_feature b
where a.id = b.id
order by distance; # Sort by vector similarity.

In the preceding SQL statement, the demo.paper table stores the basic information of each uploaded article, and the demo.dna_feature table stores the vectors that correspond to gene sequences of each species. The gene-to-vector model is used to convert genes into vectors such as [-0.017,-0.032,...], and these vectors can be used for retrieval in AnalyticDB for MySQL databases.

The current system also supports hybrid retrieval of structured and non-structured information (nucleotide sequences). For example, to query gene segments that are similar to the pneumonia virus, you need to only add where title like'%COVID-19%' to the SQL statement.

Appendixes

[1] Mikolov Tomas; et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781.
[2] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado, Greg S. and Dean Jeff (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. arXiv:1310.4546. Bibcode:2013arXiv1310.4546M.
[3] Mapleson Daniel, Garcia Accinelli, Gonzalo, Kettleborough George, Wright Jonathan and Clavijo, Bernardo J. (2016). "KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies". Bioinformatics. 33(4): 574-576. doi:10.1093/bioinformatics/btw663. ISSN 1367-4803. PMC 5408915. PMID 27797770.
[4] Quoc Le and Tomas Mikolov. (2014). Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188-1196.
[5] The Human Genome HG38, https://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz.
[7] Julia Piantadosi, Phil Howlett and John Boland. (2007). "Matching the grade correlation coefficient using a copula with maximum disorder", Journal of Industrial and Management Optimization, 3 (2), 305-312.
[8] Stephen Woloszynek, Zhengqiao Zhao, Jian Chen and Gail L. Rosen. (2019). "16s rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses", PLoS Computational Biology, 15(2), e1006721.
[9] James K. Senter, Taylor M. Royalty, Andrew D. Steen and Amir Sadovnik. (2019) "Unaligned Sequence Similarity Search Using Deep Learning.", arXiv e-prints.
[10] Ng Patrick. (2017) dna2vec: consistent vector representations of variable-length k-mers. arXiv preprint, arXiv:1701.06279.