Construction and Practice of Multivariate Scientific Computing System in Drug R&D Pipeline
01 Scientific computing drives the trend of drug research and development
The following figure is extracted from Nature Reviews at the beginning of 2022. It can be seen that the number of drug research and development projects driven by scientific computing or artificial intelligence has increased from 6 in 2010 to 158 in 2021, with an increase of more than 28 times in 2011. While the number of traditional drug R&D projects decreased from 705 to 333, although it is still the main drug R&D model, it has shown a downward trend.
Traditional drug R&D pipelines need to involve a large number of wet experiments, and most of them are optimized based on the personal experience and experimental results of scientists. The optimization path is long, the R&D cost is high, and the cycle is also long. In contrast, it is a computation-driven drug research and development pipeline, which is a combination of dry and wet, reducing the wet experiment link. In addition, many data-driven methods have learned historical or global experimental data, and are more inclined to global optimization when optimizing compounds. The optimization process is faster, the cost is lower, and the iteration speed is also faster.
The above figure shows the drug research of 24 biopharmaceutical companies driven by scientific computing/AI in the world, 15 of which have entered the clinical trial stage. It is believed that in the near future, more compute-driven drugs will be successfully marketed, benefiting more patients.
02 Characteristics and problems of drug R&D pipeline at different stages
As an innovative small molecule drug research and development institution, the Global Health Drug Research and Development Center is also using a variety of computing methods to solve different problems in the early stages of drug research and development.
The general process in the early stage of drug research and development is as follows:
Stage 1: disease biology, that is, the establishment of disease. Diseases can be roughly divided into exogenous diseases and endogenous diseases. Exogenous diseases refer to some tissue diseases caused by the invasion of foreign organisms or non-organisms into the human body, such as harmful microorganisms, bacteria, viruses, malaria parasites or dust and other non-organisms; Endogenous diseases refer to the tissue diseases caused by human genetic variation or functional disorders, such as various tumors, cardio-cerebrovascular diseases, chronic diseases and rare diseases.
The Global Health Drug Research and Development Center focuses on the public sphere of global health. We not only pay attention to exogenous infectious diseases such as tuberculosis, coronavirus, malaria and parasitic infections, but also pay attention to endogenous diseases such as some intestinal diseases such as EED.
Stage 2: target establishment and validation, i.e. proteins or biomarkers strongly related to disease. At this stage, there will be a variety of heterogeneous data. Researchers need to analyze the mechanism of disease, the performance of disease in the biological network pathway, and also include some gene variation and expression and other multi-group information.
Stage 3: establish, screen or design molecules that can interact with proteins, namely, seedy compounds. The goal is to screen small molecules that may produce activity in the molecular synthesis library on the one hand, and to design innovative active molecules on the other hand. At this stage, there is and can obtain a large number of physical or virtual compound database data, which can reach billion levels, such as Chemdiv, Zinc, etc. However, the active compound molecules targeting the target protein are relatively rare, especially some rare diseases or diseases of which human beings are not particularly concerned.
Stage 4: seedling - lead compound optimization
Stage 5: preclinical candidate drugs. In these two stages, not only the interaction between the compound and the target protein should be considered, but also the pharmacokinetics, synthesis process, and drug availability, such as distribution, metabolism, and toxicology, should be integrated to optimize and design a truly effective and safe drug after balancing various properties. This is a comprehensive optimization process, involving a large number of ADMET data collection and training modeling. At the same time, there will also be a small amount of experimental data of R&D pipeline projects.
03 Construction of multivariate scientific computing system
From the selection of disease to the target establishment stage, the data are diverse and heterogeneous.
For endogenous diseases, a multi-group analysis is usually carried out. Through analyzing the metabonomics, genomics or proteomics information of normal people and patients, we can find the Hub gene/protein or key gene/protein that is strongly related to the disease as a target candidate. After the protein sequence is obtained, the 3D structure is predicted using the structure prediction model. Among the structural prediction models, Alphafold is an innovative deep learning method in recent years, and the traditional machine learning and physical modeling methods can also obtain candidate target structures.
For exogenous diseases, methods to find targets include:
1. We can analyze the human immune mechanism, such as fusion mechanism, to study the human multi-group information and find the key target of human;
2. It can also directly analyze the multi-group situation of the colony and determine the protein in the key pathway as the target candidate;
3. For some relatively simple pathogens, such as viruses, we can directly obtain their protein sequences in the process of invading human fusion or transcription, predict all relevant protein structures, and provide them to biologists or chemists for analysis, so as to determine the target.
Structural biology will need to analyze the real structure of the determined protein, and verify and calibrate the predicted 3D structure for analysis and prediction in the subsequent stage.
After determining the target protein, the next step is to find the possible pockets of binding with molecules on the target. Binding pockets refer to the binding sites with which molecular compounds can interact. There are two main calculation methods to determine whether the compound can interact with the target, that is, whether it has potential activity:
1. Use physical simulation methods such as molecular mechanics or quantum mechanics;
2. Use machine learning or deep learning methods. These two methods are used to virtually screen compounds with high possibility of interaction with the target from the known/virtual generated compound library as candidate compounds.
In addition to using the method of virtual screening compound library, more and more researchers are trying to design seedling compounds directly from pocket physical and chemical properties in an end-to-end way, which can skip the part of physical simulation or machine learning virtual screening compound library, and use AI to directly generate potentially active seedling compounds. I believe this will also become one of the key research directions in the future.
After obtaining the candidate seedling compounds, biological and chemical experts will conduct wet experiment verification or structural biology will conduct analytical verification of the compound - target eutectic structure to confirm whether it conforms to the predicted results and use it for the next stage of compound optimization.
Comparing physical simulation with machine learning method, physical simulation is a kind of method commonly used by many pharmaceutical companies at present. Its advantages are that the MD docking attitude estimation is more accurate, and the FEP+affinity prediction is more accurate. The combination of small molecules and pockets can be seen intuitively by using 3D modeling, and the interpretation is also strong; The disadvantage is that the computational power required is very high, and there is a demand for flexible supercomputation. In addition, it is based on physical assumptions and can be applied in a narrow range. It cannot cope with some complex mechanisms, such as the prediction of multi-target or protein allosteric phenomena, or the prediction of the properties of compounds at the level of cells, organoids or human tissues at a higher level.
The machine learning method mainly optimizes the parameters of the given mathematical model through the known data training, so the model size generated by the data training is fixed, and the model can be used to quickly screen the super large compound library. Secondly, it is based on empirical or experimental data and does not rely on physical assumptions. It can model and predict the properties of complex mechanisms or higher levels; Its disadvantage is that it largely depends on the quality of data and the distribution of data space. If the data storage is large and the quality is high, the performance of machine learning or deep learning will be good, otherwise, the performance may be poor. In addition, its generalization ability is also very limited by the data space it can see, and machine learning is a black-box method, and it is difficult for scientists to determine its basis.
Taking virtual screening of 1 million small molecule compounds as an example, it takes about 148600 seconds to use the physical simulation method docking, while it takes only 107 seconds to use the deep learning method on the v100 GPU, with a speed difference of more than 1000 times. In addition, the combination of a compound and target protein site is simulated on the machine by molecular dynamics method with higher precision. The simulation time is 200 nanoseconds in the 60-90000 atomic system, and it takes about 86400 seconds on the GPU of v100. It can be seen that the method based on physical simulation requires high computational power.
In the process of discovery and establishment of early sprout compounds, researchers are usually able to obtain very few experimental data targeted at the target. If we directly use these data to model the deep learning algorithm, the machine can only see a very limited chemical space, and the generalization ability and prediction robustness of the trained model are poor. Therefore, we adopt the active learning method, use expert experience or some physical functions to calibrate the AI model, and constantly expand the training set. After several rounds of iteration, the model can be put into use.
In addition, because many AI models themselves are black box models, biologists or chemists may not fully trust the results it gives. For this reason, we have developed a deep learning algorithm, Ligandformer, based on the self-attention mechanism. The model can not only give the prediction score of the properties or activities of compounds, but also give the explanation of the contribution of molecular fragments to the activities/properties, for scientific researchers' reference and reference.
From the seedling compound to the lead compound, and then to the pre-clinical candidate drug, a series of optimization and transformation of the seedling compound are needed. In the process of optimization, the general process at the calculation level is to use big data to pre-train models with different properties to obtain the preconditioned model, and fine-tune the preconditioned model through the experimental data in the actual research and development pipeline, and then use the fine-tuned model to screen the structure of various modified lead compounds in large quantities. Finally, after balancing various properties, a list of candidate drugs is obtained, which can be provided to biologists or chemists for reference and selection and the next wet experiment verification.
It can be seen from the above figure that the calculation process runs through the early stage of drug research and development.
04 Practice of E-HPC platform of multivariate scientific computing system
At the beginning of the outbreak of COVID-19 in 2020, Alibaba Cloud team worked with us to build a public information platform to fight COVID-19, collecting information about virus research from global information sources. At the same time, we have also built a predictive service platform, which is an external service built on the supercomputing platform and is free for scientists to use. At present, the service has been upgraded and optimized and widely used in more than 20 internal and external cooperation projects.
In addition, we have collected and collated a large number of data from commercial and non-commercial databases around the world, and established a visual structure-property data analysis tool to help scientists better conduct research.
In a previous project of discovery of emerging compounds, we needed to better characterize and screen the chemical space of PubChem 400000 compound library. We use the active learning strategy to train the deep learning model and screen the compound library. After five iterations of active learning, the error rate decreased from 7.98% to less than one in ten thousand. At the same time, relying on expert experience, the training data samples have been continuously expanded, and the training data samples have only increased by more than 1500. A total of more than 2800 training data is not very large, but it makes the machine learning model show relatively strong discrimination ability, and can distinguish the chemical space in 400000 compound library.
At the same time, we performed retrospective verification on 37 experimental data in the project. From the initial model to the fifth model, the accuracy rate has increased from 75% to 86%.
We have carried out research on rare diseases with Beijing Union Medical College Hospital, and used the internal self-developed bio-information network interaction algorithm to recalibrate the protein-protein interaction network. A new drug for ATTR rare disease was found through the calibrated network and the bioinformatics statistical method, and at the same time, a drug for lymphoid leukemia was repositioned. This work has been included in a medical journal.
In general, the methods based on molecular mechanics are mainly applied to the tasks that have known targets or need to determine targets, such as early target establishment, seedling compound establishment and seedling - lead compound optimization stage; The method based on machine learning/deep learning can be applied to the establishment of seedling-lead compounds, the optimization stage of seedling-lead compounds, and the optimization stage of pre-clinical candidate drugs. In addition, it can also be applied to the scenes of unknown targets, such as only some phenotypic data need to be modelled through data-driven modeling, such as the prediction of cell, tissue, organ-like or human level properties in the later stage of drug research and development, and the analysis of drug availability.
05 Challenges and opportunities
In the future, we will conduct in-depth research in the following aspects:
First, research on complex therapeutic mechanisms and targets. For example, the study of drug resistance of bacteria and the prediction of protein transformation.
Second, mutation prediction of target active sites. For example, the coronavirus will continue to mutate, and the effectiveness of the drug at the mutated site can be determined by calculation and analysis.
Third, the molecular design of innovative drugs. More and more researchers focus on the generation and design of active molecules based on protein target pockets, and can also generate and design molecular compounds end-to-end based on phenotypic data.
So, how to solve or break through the problem? First, data is essential. In addition to the data of the physical and chemical properties of molecular compounds, more horizontal data, such as the network information data in the biological information network or pathway, can also be integrated into the lower level data, such as the electronic cloud density data.
However, the huge, diverse and heterogeneous data need powerful algorithms that can fuse data at different levels and scales, and can extract pattern features from the data to predict the final task. All this is bound to be inseparable from the supercomputing platform, so our demand for the supercomputing platform is also gradually increasing. We need to have a larger data bearing and processing capacity, and need to have faster speed and progress.
I believe that with the cooperation of data, algorithms and supercomputing platforms, and the joint efforts of cross-professional and cross-industry talents, the pharmaceutical research industry will make greater breakthroughs.
The following figure is extracted from Nature Reviews at the beginning of 2022. It can be seen that the number of drug research and development projects driven by scientific computing or artificial intelligence has increased from 6 in 2010 to 158 in 2021, with an increase of more than 28 times in 2011. While the number of traditional drug R&D projects decreased from 705 to 333, although it is still the main drug R&D model, it has shown a downward trend.
Traditional drug R&D pipelines need to involve a large number of wet experiments, and most of them are optimized based on the personal experience and experimental results of scientists. The optimization path is long, the R&D cost is high, and the cycle is also long. In contrast, it is a computation-driven drug research and development pipeline, which is a combination of dry and wet, reducing the wet experiment link. In addition, many data-driven methods have learned historical or global experimental data, and are more inclined to global optimization when optimizing compounds. The optimization process is faster, the cost is lower, and the iteration speed is also faster.
The above figure shows the drug research of 24 biopharmaceutical companies driven by scientific computing/AI in the world, 15 of which have entered the clinical trial stage. It is believed that in the near future, more compute-driven drugs will be successfully marketed, benefiting more patients.
02 Characteristics and problems of drug R&D pipeline at different stages
As an innovative small molecule drug research and development institution, the Global Health Drug Research and Development Center is also using a variety of computing methods to solve different problems in the early stages of drug research and development.
The general process in the early stage of drug research and development is as follows:
Stage 1: disease biology, that is, the establishment of disease. Diseases can be roughly divided into exogenous diseases and endogenous diseases. Exogenous diseases refer to some tissue diseases caused by the invasion of foreign organisms or non-organisms into the human body, such as harmful microorganisms, bacteria, viruses, malaria parasites or dust and other non-organisms; Endogenous diseases refer to the tissue diseases caused by human genetic variation or functional disorders, such as various tumors, cardio-cerebrovascular diseases, chronic diseases and rare diseases.
The Global Health Drug Research and Development Center focuses on the public sphere of global health. We not only pay attention to exogenous infectious diseases such as tuberculosis, coronavirus, malaria and parasitic infections, but also pay attention to endogenous diseases such as some intestinal diseases such as EED.
Stage 2: target establishment and validation, i.e. proteins or biomarkers strongly related to disease. At this stage, there will be a variety of heterogeneous data. Researchers need to analyze the mechanism of disease, the performance of disease in the biological network pathway, and also include some gene variation and expression and other multi-group information.
Stage 3: establish, screen or design molecules that can interact with proteins, namely, seedy compounds. The goal is to screen small molecules that may produce activity in the molecular synthesis library on the one hand, and to design innovative active molecules on the other hand. At this stage, there is and can obtain a large number of physical or virtual compound database data, which can reach billion levels, such as Chemdiv, Zinc, etc. However, the active compound molecules targeting the target protein are relatively rare, especially some rare diseases or diseases of which human beings are not particularly concerned.
Stage 4: seedling - lead compound optimization
Stage 5: preclinical candidate drugs. In these two stages, not only the interaction between the compound and the target protein should be considered, but also the pharmacokinetics, synthesis process, and drug availability, such as distribution, metabolism, and toxicology, should be integrated to optimize and design a truly effective and safe drug after balancing various properties. This is a comprehensive optimization process, involving a large number of ADMET data collection and training modeling. At the same time, there will also be a small amount of experimental data of R&D pipeline projects.
03 Construction of multivariate scientific computing system
From the selection of disease to the target establishment stage, the data are diverse and heterogeneous.
For endogenous diseases, a multi-group analysis is usually carried out. Through analyzing the metabonomics, genomics or proteomics information of normal people and patients, we can find the Hub gene/protein or key gene/protein that is strongly related to the disease as a target candidate. After the protein sequence is obtained, the 3D structure is predicted using the structure prediction model. Among the structural prediction models, Alphafold is an innovative deep learning method in recent years, and the traditional machine learning and physical modeling methods can also obtain candidate target structures.
For exogenous diseases, methods to find targets include:
1. We can analyze the human immune mechanism, such as fusion mechanism, to study the human multi-group information and find the key target of human;
2. It can also directly analyze the multi-group situation of the colony and determine the protein in the key pathway as the target candidate;
3. For some relatively simple pathogens, such as viruses, we can directly obtain their protein sequences in the process of invading human fusion or transcription, predict all relevant protein structures, and provide them to biologists or chemists for analysis, so as to determine the target.
Structural biology will need to analyze the real structure of the determined protein, and verify and calibrate the predicted 3D structure for analysis and prediction in the subsequent stage.
After determining the target protein, the next step is to find the possible pockets of binding with molecules on the target. Binding pockets refer to the binding sites with which molecular compounds can interact. There are two main calculation methods to determine whether the compound can interact with the target, that is, whether it has potential activity:
1. Use physical simulation methods such as molecular mechanics or quantum mechanics;
2. Use machine learning or deep learning methods. These two methods are used to virtually screen compounds with high possibility of interaction with the target from the known/virtual generated compound library as candidate compounds.
In addition to using the method of virtual screening compound library, more and more researchers are trying to design seedling compounds directly from pocket physical and chemical properties in an end-to-end way, which can skip the part of physical simulation or machine learning virtual screening compound library, and use AI to directly generate potentially active seedling compounds. I believe this will also become one of the key research directions in the future.
After obtaining the candidate seedling compounds, biological and chemical experts will conduct wet experiment verification or structural biology will conduct analytical verification of the compound - target eutectic structure to confirm whether it conforms to the predicted results and use it for the next stage of compound optimization.
Comparing physical simulation with machine learning method, physical simulation is a kind of method commonly used by many pharmaceutical companies at present. Its advantages are that the MD docking attitude estimation is more accurate, and the FEP+affinity prediction is more accurate. The combination of small molecules and pockets can be seen intuitively by using 3D modeling, and the interpretation is also strong; The disadvantage is that the computational power required is very high, and there is a demand for flexible supercomputation. In addition, it is based on physical assumptions and can be applied in a narrow range. It cannot cope with some complex mechanisms, such as the prediction of multi-target or protein allosteric phenomena, or the prediction of the properties of compounds at the level of cells, organoids or human tissues at a higher level.
The machine learning method mainly optimizes the parameters of the given mathematical model through the known data training, so the model size generated by the data training is fixed, and the model can be used to quickly screen the super large compound library. Secondly, it is based on empirical or experimental data and does not rely on physical assumptions. It can model and predict the properties of complex mechanisms or higher levels; Its disadvantage is that it largely depends on the quality of data and the distribution of data space. If the data storage is large and the quality is high, the performance of machine learning or deep learning will be good, otherwise, the performance may be poor. In addition, its generalization ability is also very limited by the data space it can see, and machine learning is a black-box method, and it is difficult for scientists to determine its basis.
Taking virtual screening of 1 million small molecule compounds as an example, it takes about 148600 seconds to use the physical simulation method docking, while it takes only 107 seconds to use the deep learning method on the v100 GPU, with a speed difference of more than 1000 times. In addition, the combination of a compound and target protein site is simulated on the machine by molecular dynamics method with higher precision. The simulation time is 200 nanoseconds in the 60-90000 atomic system, and it takes about 86400 seconds on the GPU of v100. It can be seen that the method based on physical simulation requires high computational power.
In the process of discovery and establishment of early sprout compounds, researchers are usually able to obtain very few experimental data targeted at the target. If we directly use these data to model the deep learning algorithm, the machine can only see a very limited chemical space, and the generalization ability and prediction robustness of the trained model are poor. Therefore, we adopt the active learning method, use expert experience or some physical functions to calibrate the AI model, and constantly expand the training set. After several rounds of iteration, the model can be put into use.
In addition, because many AI models themselves are black box models, biologists or chemists may not fully trust the results it gives. For this reason, we have developed a deep learning algorithm, Ligandformer, based on the self-attention mechanism. The model can not only give the prediction score of the properties or activities of compounds, but also give the explanation of the contribution of molecular fragments to the activities/properties, for scientific researchers' reference and reference.
From the seedling compound to the lead compound, and then to the pre-clinical candidate drug, a series of optimization and transformation of the seedling compound are needed. In the process of optimization, the general process at the calculation level is to use big data to pre-train models with different properties to obtain the preconditioned model, and fine-tune the preconditioned model through the experimental data in the actual research and development pipeline, and then use the fine-tuned model to screen the structure of various modified lead compounds in large quantities. Finally, after balancing various properties, a list of candidate drugs is obtained, which can be provided to biologists or chemists for reference and selection and the next wet experiment verification.
It can be seen from the above figure that the calculation process runs through the early stage of drug research and development.
04 Practice of E-HPC platform of multivariate scientific computing system
At the beginning of the outbreak of COVID-19 in 2020, Alibaba Cloud team worked with us to build a public information platform to fight COVID-19, collecting information about virus research from global information sources. At the same time, we have also built a predictive service platform, which is an external service built on the supercomputing platform and is free for scientists to use. At present, the service has been upgraded and optimized and widely used in more than 20 internal and external cooperation projects.
In addition, we have collected and collated a large number of data from commercial and non-commercial databases around the world, and established a visual structure-property data analysis tool to help scientists better conduct research.
In a previous project of discovery of emerging compounds, we needed to better characterize and screen the chemical space of PubChem 400000 compound library. We use the active learning strategy to train the deep learning model and screen the compound library. After five iterations of active learning, the error rate decreased from 7.98% to less than one in ten thousand. At the same time, relying on expert experience, the training data samples have been continuously expanded, and the training data samples have only increased by more than 1500. A total of more than 2800 training data is not very large, but it makes the machine learning model show relatively strong discrimination ability, and can distinguish the chemical space in 400000 compound library.
At the same time, we performed retrospective verification on 37 experimental data in the project. From the initial model to the fifth model, the accuracy rate has increased from 75% to 86%.
We have carried out research on rare diseases with Beijing Union Medical College Hospital, and used the internal self-developed bio-information network interaction algorithm to recalibrate the protein-protein interaction network. A new drug for ATTR rare disease was found through the calibrated network and the bioinformatics statistical method, and at the same time, a drug for lymphoid leukemia was repositioned. This work has been included in a medical journal.
In general, the methods based on molecular mechanics are mainly applied to the tasks that have known targets or need to determine targets, such as early target establishment, seedling compound establishment and seedling - lead compound optimization stage; The method based on machine learning/deep learning can be applied to the establishment of seedling-lead compounds, the optimization stage of seedling-lead compounds, and the optimization stage of pre-clinical candidate drugs. In addition, it can also be applied to the scenes of unknown targets, such as only some phenotypic data need to be modelled through data-driven modeling, such as the prediction of cell, tissue, organ-like or human level properties in the later stage of drug research and development, and the analysis of drug availability.
05 Challenges and opportunities
In the future, we will conduct in-depth research in the following aspects:
First, research on complex therapeutic mechanisms and targets. For example, the study of drug resistance of bacteria and the prediction of protein transformation.
Second, mutation prediction of target active sites. For example, the coronavirus will continue to mutate, and the effectiveness of the drug at the mutated site can be determined by calculation and analysis.
Third, the molecular design of innovative drugs. More and more researchers focus on the generation and design of active molecules based on protein target pockets, and can also generate and design molecular compounds end-to-end based on phenotypic data.
So, how to solve or break through the problem? First, data is essential. In addition to the data of the physical and chemical properties of molecular compounds, more horizontal data, such as the network information data in the biological information network or pathway, can also be integrated into the lower level data, such as the electronic cloud density data.
However, the huge, diverse and heterogeneous data need powerful algorithms that can fuse data at different levels and scales, and can extract pattern features from the data to predict the final task. All this is bound to be inseparable from the supercomputing platform, so our demand for the supercomputing platform is also gradually increasing. We need to have a larger data bearing and processing capacity, and need to have faster speed and progress.
I believe that with the cooperation of data, algorithms and supercomputing platforms, and the joint efforts of cross-professional and cross-industry talents, the pharmaceutical research industry will make greater breakthroughs.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00