Pain points and direction of AI pharmaceutical industry landing
NIH's 4D map is the gold standard in the global pharmaceutical industry. Take small molecules as an example, from the recognition of targets to the discovery and optimization of lead compounds, from early discovery to development to the final clinical trial, each step has a very mature methodology, experimental platform, theoretical guidance and regulatory standards.
However, the degree of digitalization of the system is very low, and the pharmaceutical industry is also a relatively low degree of digitalization in all industries.
Data related to translational medicine and biomarkers, clinical data, regulatory data, medical insurance data, and signal data of clinical sampling and in vitro sampling need to be handled by different institutions and researchers. Based on this, there are only two ways for the pharmaceutical industry to improve efficiency in the system: first, redefine the whole system; Second, remove the noise from the historical data, find the signal, and replace the outdated methodology with the most advanced methodology.
The above figure shows the drug screening process. The number of experiments required from the first step to the last step determines the final system efficiency. The traditional process basically needs from 20000 molecular sieves to one molecule, and the blind screen basically needs 2 million molecules as the starting point. If 100 molecules can be used as the starting point, the investment and time consumption of the whole industry will be saved by more than 80%.
The above screening process has been used for many years, but the input-output ratio has been declining year by year in the past 10 years. So we need to consider: how to break through the existing screening process? Can AI help improve efficiency?
Strictly speaking, AI is not a tool, because tools need people to use them, and AI can optimize itself and achieve its goals without human help. In the discipline definition of AI, it needs to have the ability to think and act like a person, and finally it needs to be confirmed by Turing test and other methods.
But the biggest difficulty in applying AI to the pharmaceutical industry is how to define goals for AI. For example, in the pharmaceutical issue, the goal can be to optimize the selectivity, the overall efficacy in vivo, and the final applicable patient group. If AI is given enough data, it can actually achieve its goals by its own means.
Therefore, people need to do two things: first, clarify the goal, and then clarify what data and rules need to be fed to AI. Finally, AI is responsible for achieving the goal.
Artificial intelligence itself is an interdisciplinary subject, and pharmacy is also a subject involving multi-dimensional information such as biochemistry, cell biology, physiology and so on. How to efficiently integrate many huge disciplinary systems is the biggest challenge we are facing.
The data covered in the above figure basically covers the input of all calculations in the pharmaceutical industry. QM (quantum mechanics), DFT (density functional theory), Molecular Mechanics (molecular mechanics) and Molecular Dynamics (molecular dynamics) are purely physical methods. There are also some experimental parameters in DFT and Molecular Mechanics for calibration, and QM is completely dependent on the atomic composition of the input molecule. They calculate at different accuracy, but accuracy and accuracy are two completely different statistical parameters. We do not necessarily need the highest accuracy, but we need the highest accuracy, so that we can judge the next step more completely.
However, the previous methodology has more or less limitations. For example, QM calculates the electronic precision, which can only be calculated in materials and some small solution chemical systems. If you want to extend to organisms, you need to make more approximations and sacrifice some accuracy, so there is DFT method. Molecular dynamics method is equivalent to borrowing some classical mechanics and empirical parameters to simulate the output of quantum mechanics, which can extend the calculation scale to the level of single protein, and the precision is omitted from electron to atom.
However, the interaction between proteins needs to be calculated later, and higher systems such as 42 million proteins in cells, if MD is used for calculation, computers in the world will not be able to add up. The human body needs to calculate the results of physiology. If it starts from the atom, it needs 42 million * 30 trillion calculations to truly map to the human body from the molecular level. Due to the limitation of computing power, after molecular dynamics, the simulation of biology based on atoms is in a dilemma, regardless of the accuracy of the analysis of the atom-based 3D structure itself. With the involvement of informatics, we see the dawn of hope.
Informatics is based on the reading of signals, which can be divided into two layers: the first layer is the signal of molecular nature, such as protein, DNA, small molecules, etc., which are all sequences, and the sequence is determined without any noise; The other level is the macro level. When molecules are put into the system, electrical signals and fluorescent signals can be observed to obtain various indirect understanding of biological events.
Thanks to the means of informatics, chemical informatics and bioinformatics have made great progress in the past 40 years. Before that, we can only use some simple statistical methods to realize the mapping from micro to macro. Since then, multiomics can analyze the DNA of all species and obtain multi-level data. The computational complexity of QM is about O (N) 4-O (N) 7, and N is about hundreds of atoms in the system with the largest electron; The computational complexity of Molecular Mechanics is reduced to 23. The largest system is about 1 million atoms, which is close to a single virus. However, the computational complexity is close to linear in the prediction scenario of statistics or machine learning, which is equivalent to saving the computational efficiency by 106 - 107 times. At present, deep learning is popular. The fundamental reason is that we cannot calculate a larger biological system through physical models. We need to exchange the computational power and experimental resources invested in generating this part of the data layer through the learning of historical data.
DNA is static, because the sequence of DNA generally does not change much. Biology is dynamic, and the measurement of RNA, protein and metabolism will change dynamically with people's age, diet and physical condition. In addition, the current process simulation of life can only reach the microsecond level from the perspective of single atom, and the enzyme reaction can also reach the microsecond to millisecond level, so the real process simulation cannot be realized. With the help of informatics, we can achieve end-to-end black box simulation, that is, end-to-end simulation.
Physicists continue to simplify physical formulas and computational complexity, so that the smallest drug molecule to the system view and different theoretical basis can be simulated from the physical level. However, it also means that we need to recalculate the experimental conditions and develop separate tools and physical paradigms, which is a rather clumsy method. We hope to find a universal model with adjustable precision, which can solve all problems with the same model.
Deep learning is our first attempt. As long as there is enough data in each dimension, we can use the black box to predict the problems in each dimension, without considering the underlying physical principles. Deep learning has also been proved to be very effective in the past practice, but it is still not the most perfect because it is too dependent on data.
The perfect way we expect is to find a universal, dynamic, multi-scale mathematical formula that can fundamentally observe biology without relying on any data.
The figure above shows the specific data formula. The traditional calculation of a GPU by a small molecule in QM takes several hours to several days (depending on the specific task), FEP is about one day, and Docking is a few minutes. In the machine learning scenario, it takes only one minute to calculate thousands to millions of molecules on a CPU.
The above figure shows some computing power tested on Alibaba Cloud. It takes about half an hour for QM to calculate the interaction of several amino acids. It takes several hours for MD to predict the behavior of large membrane proteins per nanosecond, while the time required for microsecond or millisecond scale needs to be multiplied by 103 or 106. The in-depth learning model has been trained, and the prediction time is shorter, and the million-level screening can be achieved in one hour.
Machine learning has been widely used in the pharmaceutical field, such as protein structure prediction, function prediction, gene editing, system biology and more physiological proteomics. The ultimate bottleneck lies in the understanding and cleaning of biological big data.
The development of AI in the field of medicine is mainly divided into four stages as shown in the figure above. So far, it has a perfect data-driven method. All along, we hope to get through all the data in the whole process and get the most efficient method.
So, what are the areas worth further breakthroughs from the disciplinary level?
We use AI not only to hope it can do faster, but also to hope it can do better and achieve some challenges that human beings cannot break through. AI can surpass people in two aspects:
First, it doesn't need to rest, and thousands of AI agents can do one job at the same time, which is a breakthrough in capability.
Second, AI's cognition of the world is multidimensional, and people can only understand the world from 3D and time dimensions, while AI can understand the world in thousands of dimensions or one-dimensional and zero-dimensional dimensions that human beings can't understand, and then get better answers.
There is an interesting phenomenon in the pharmaceutical field: two-dimensional cognition is completely opposite to one-dimensional cognition. As shown in the figure above, PK is an important factor that affects physiological indicators. In different cases, it will have a huge contrast. From a human perspective, they may be very similar, but AI can recognize greater differences from some dimensions other than two-dimensional.
In addition, experts can only optimize one problem in one dimension at a time, so one project will generate infinite iterations. If the most typical multi-objective optimization method of AI is adopted, multiple optimization can be realized from multiple dimensions at one time. In the past practice, we have verified that the hit rate of using AI to score and experiment in 30 dimensions at the same time is much higher than that of artificial thinking and experiment. Therefore, we also believe that AI can do better than experts in this field.
New projects generally start with phenotypic screening and directly predict potential assumptions from phenotypes, which will involve the black box problem, which is what AI is good at. In the past, most of the original new drugs belong to Phenotypic Screening, while most of the Follower drugs belong to Target-based Screening.
AI Phenotype Screening has made many attempts. For example, in the past, we built AI models for 3000 cell-based assessments one by one in GHDDI, and then conducted two large-scale verifications: retrospective and prospective. Finally, we found that only 5% of the data in the past 30 years can basically approach the real cell-based results. But this has been a good result, at least proving that the number keeps growing.
Synthesis has always been the bottleneck of small molecule drugs. The relevant articles in Science magazine show that AI passed the Turing test means that the total synthesis path of natural products is predictable. But the bottleneck of synthesis is not route prediction, but reaction condition prediction.
AlphaFold has attracted much attention and is considered an epoch-making feat. But we need to confirm three questions first:
First, does the pharmaceutical industry need to know the structure? Normal biological discovery can be directly screened on cells or directly purified on protein screen. What is known is only sequence and binding affinity, and process simulation is not required, but the advantage of process simulation is that some key sites can be modified.
Secondly, compared with the traditional homologous modeling, the result of AlphaFold prediction is better when there is a known template. Among them, the flow involving the deep algorithm in AlphaFold uses Multisequence alignment, which is to use the information of all protein families of all other species to predict the information of higher organisms, which will cause problems in many core areas. If the traditional homology modeling is generally a more similar species or the same proteome group of the same species, it can be directly predicted on the known model. Therefore, in the real pharmaceutical process, the traditional homology modeling has higher confidence. For proteins without templates, other methods need to be taken.
Finally, our approach is to predict biological activity directly from the primary structure, completely skipping the process of structure biology, and thus avoiding the error in this process.
In 2013, I spent two months calling 1024 CPUs to get about one microsecond of membrane protein, phospholipid bilayer and small molecule. At that time, it was the largest computable membrane protein system in the world, involving millions of atoms. With the same hardware configuration of today's supercomputers, the above time can be reduced to 2-3 days, but this is only a 30-fold increase, which means that it is still very difficult to calculate the dynamic process truly and systematically.
Therefore, we must make full use of the Data Driven AI model. Through the link below, you can view the relevant content of this rolling update review, which provides solutions to the data limitation problem and how to model.
There is a lot of noise in biological big data. How to extract signals from the noise and integrate clean data sets is particularly important. The industry provides a lot of solutions at the methodological, engineering and algorithmic levels, such as the Multimodal method. If the amount of data on one scale is small, it can be migrated from other scales, such as the multi-task method. If the data on a target is small, its family or all similar pocket data will be found for migration learning to make up for its data limitations.
The most useful AI model must be a model with strong generalization ability. It must be able to predict the unknown from the known. This is the most meaningful AI. Therefore, fundamentally speaking, transfer learning is the most effective method.
To make target specific prediction, experts only need to feed back a small number of results or several to dozens of data to conduct fine-tuning, and then generally only need to carry out five rounds of active learning to achieve the desired results, which is far better than the previous blind screening.
In addition, there are generally three ways to generate data:
First, from the existing data, we have collected all the commercial databases and 100+open source databases in the world, and finally eliminated 95% of the data, which is also a re-examination of history.
Second, we should do our own experiments to supplement some data pertinently. We need to clarify the chemical and biological spatial distribution of the data, and promote the optimal model performance with the least data points.
Third, for analog data, such as QM, the calculation is most accurate. First, use the physical bottom sampling, and finally use these data to replace the calculation force that has been consumed, without having to recalculate again.
At present, our integrated solution of Yuanyi Intelligence has an abstract display as shown in the figure. For details, see the official website. Starting from the target sequence, we can score dozens of AI models in a few hours at the same time, and can position 10-20 new molecules. Basically, it takes only 2-3 rounds to get the target compound within 100 minutes.
In addition, in terms of computing power, we have made very flexible solutions in terms of training, calling, GPU and CPU allocation, which has become a mature automation platform.
In June this year, Yuanyi Smart first released the multi-objective AI model at the Bio International Conference, providing the ability of automatic design for biological drugs, chemical drugs and nucleic acid drugs, and has close cooperation with many CRO, CDMO, and pharmaceutical companies around the world. The company has received orders of US $3 million a year since its establishment.
However, the degree of digitalization of the system is very low, and the pharmaceutical industry is also a relatively low degree of digitalization in all industries.
Data related to translational medicine and biomarkers, clinical data, regulatory data, medical insurance data, and signal data of clinical sampling and in vitro sampling need to be handled by different institutions and researchers. Based on this, there are only two ways for the pharmaceutical industry to improve efficiency in the system: first, redefine the whole system; Second, remove the noise from the historical data, find the signal, and replace the outdated methodology with the most advanced methodology.
The above figure shows the drug screening process. The number of experiments required from the first step to the last step determines the final system efficiency. The traditional process basically needs from 20000 molecular sieves to one molecule, and the blind screen basically needs 2 million molecules as the starting point. If 100 molecules can be used as the starting point, the investment and time consumption of the whole industry will be saved by more than 80%.
The above screening process has been used for many years, but the input-output ratio has been declining year by year in the past 10 years. So we need to consider: how to break through the existing screening process? Can AI help improve efficiency?
Strictly speaking, AI is not a tool, because tools need people to use them, and AI can optimize itself and achieve its goals without human help. In the discipline definition of AI, it needs to have the ability to think and act like a person, and finally it needs to be confirmed by Turing test and other methods.
But the biggest difficulty in applying AI to the pharmaceutical industry is how to define goals for AI. For example, in the pharmaceutical issue, the goal can be to optimize the selectivity, the overall efficacy in vivo, and the final applicable patient group. If AI is given enough data, it can actually achieve its goals by its own means.
Therefore, people need to do two things: first, clarify the goal, and then clarify what data and rules need to be fed to AI. Finally, AI is responsible for achieving the goal.
Artificial intelligence itself is an interdisciplinary subject, and pharmacy is also a subject involving multi-dimensional information such as biochemistry, cell biology, physiology and so on. How to efficiently integrate many huge disciplinary systems is the biggest challenge we are facing.
The data covered in the above figure basically covers the input of all calculations in the pharmaceutical industry. QM (quantum mechanics), DFT (density functional theory), Molecular Mechanics (molecular mechanics) and Molecular Dynamics (molecular dynamics) are purely physical methods. There are also some experimental parameters in DFT and Molecular Mechanics for calibration, and QM is completely dependent on the atomic composition of the input molecule. They calculate at different accuracy, but accuracy and accuracy are two completely different statistical parameters. We do not necessarily need the highest accuracy, but we need the highest accuracy, so that we can judge the next step more completely.
However, the previous methodology has more or less limitations. For example, QM calculates the electronic precision, which can only be calculated in materials and some small solution chemical systems. If you want to extend to organisms, you need to make more approximations and sacrifice some accuracy, so there is DFT method. Molecular dynamics method is equivalent to borrowing some classical mechanics and empirical parameters to simulate the output of quantum mechanics, which can extend the calculation scale to the level of single protein, and the precision is omitted from electron to atom.
However, the interaction between proteins needs to be calculated later, and higher systems such as 42 million proteins in cells, if MD is used for calculation, computers in the world will not be able to add up. The human body needs to calculate the results of physiology. If it starts from the atom, it needs 42 million * 30 trillion calculations to truly map to the human body from the molecular level. Due to the limitation of computing power, after molecular dynamics, the simulation of biology based on atoms is in a dilemma, regardless of the accuracy of the analysis of the atom-based 3D structure itself. With the involvement of informatics, we see the dawn of hope.
Informatics is based on the reading of signals, which can be divided into two layers: the first layer is the signal of molecular nature, such as protein, DNA, small molecules, etc., which are all sequences, and the sequence is determined without any noise; The other level is the macro level. When molecules are put into the system, electrical signals and fluorescent signals can be observed to obtain various indirect understanding of biological events.
Thanks to the means of informatics, chemical informatics and bioinformatics have made great progress in the past 40 years. Before that, we can only use some simple statistical methods to realize the mapping from micro to macro. Since then, multiomics can analyze the DNA of all species and obtain multi-level data. The computational complexity of QM is about O (N) 4-O (N) 7, and N is about hundreds of atoms in the system with the largest electron; The computational complexity of Molecular Mechanics is reduced to 23. The largest system is about 1 million atoms, which is close to a single virus. However, the computational complexity is close to linear in the prediction scenario of statistics or machine learning, which is equivalent to saving the computational efficiency by 106 - 107 times. At present, deep learning is popular. The fundamental reason is that we cannot calculate a larger biological system through physical models. We need to exchange the computational power and experimental resources invested in generating this part of the data layer through the learning of historical data.
DNA is static, because the sequence of DNA generally does not change much. Biology is dynamic, and the measurement of RNA, protein and metabolism will change dynamically with people's age, diet and physical condition. In addition, the current process simulation of life can only reach the microsecond level from the perspective of single atom, and the enzyme reaction can also reach the microsecond to millisecond level, so the real process simulation cannot be realized. With the help of informatics, we can achieve end-to-end black box simulation, that is, end-to-end simulation.
Physicists continue to simplify physical formulas and computational complexity, so that the smallest drug molecule to the system view and different theoretical basis can be simulated from the physical level. However, it also means that we need to recalculate the experimental conditions and develop separate tools and physical paradigms, which is a rather clumsy method. We hope to find a universal model with adjustable precision, which can solve all problems with the same model.
Deep learning is our first attempt. As long as there is enough data in each dimension, we can use the black box to predict the problems in each dimension, without considering the underlying physical principles. Deep learning has also been proved to be very effective in the past practice, but it is still not the most perfect because it is too dependent on data.
The perfect way we expect is to find a universal, dynamic, multi-scale mathematical formula that can fundamentally observe biology without relying on any data.
The figure above shows the specific data formula. The traditional calculation of a GPU by a small molecule in QM takes several hours to several days (depending on the specific task), FEP is about one day, and Docking is a few minutes. In the machine learning scenario, it takes only one minute to calculate thousands to millions of molecules on a CPU.
The above figure shows some computing power tested on Alibaba Cloud. It takes about half an hour for QM to calculate the interaction of several amino acids. It takes several hours for MD to predict the behavior of large membrane proteins per nanosecond, while the time required for microsecond or millisecond scale needs to be multiplied by 103 or 106. The in-depth learning model has been trained, and the prediction time is shorter, and the million-level screening can be achieved in one hour.
Machine learning has been widely used in the pharmaceutical field, such as protein structure prediction, function prediction, gene editing, system biology and more physiological proteomics. The ultimate bottleneck lies in the understanding and cleaning of biological big data.
The development of AI in the field of medicine is mainly divided into four stages as shown in the figure above. So far, it has a perfect data-driven method. All along, we hope to get through all the data in the whole process and get the most efficient method.
So, what are the areas worth further breakthroughs from the disciplinary level?
We use AI not only to hope it can do faster, but also to hope it can do better and achieve some challenges that human beings cannot break through. AI can surpass people in two aspects:
First, it doesn't need to rest, and thousands of AI agents can do one job at the same time, which is a breakthrough in capability.
Second, AI's cognition of the world is multidimensional, and people can only understand the world from 3D and time dimensions, while AI can understand the world in thousands of dimensions or one-dimensional and zero-dimensional dimensions that human beings can't understand, and then get better answers.
There is an interesting phenomenon in the pharmaceutical field: two-dimensional cognition is completely opposite to one-dimensional cognition. As shown in the figure above, PK is an important factor that affects physiological indicators. In different cases, it will have a huge contrast. From a human perspective, they may be very similar, but AI can recognize greater differences from some dimensions other than two-dimensional.
In addition, experts can only optimize one problem in one dimension at a time, so one project will generate infinite iterations. If the most typical multi-objective optimization method of AI is adopted, multiple optimization can be realized from multiple dimensions at one time. In the past practice, we have verified that the hit rate of using AI to score and experiment in 30 dimensions at the same time is much higher than that of artificial thinking and experiment. Therefore, we also believe that AI can do better than experts in this field.
New projects generally start with phenotypic screening and directly predict potential assumptions from phenotypes, which will involve the black box problem, which is what AI is good at. In the past, most of the original new drugs belong to Phenotypic Screening, while most of the Follower drugs belong to Target-based Screening.
AI Phenotype Screening has made many attempts. For example, in the past, we built AI models for 3000 cell-based assessments one by one in GHDDI, and then conducted two large-scale verifications: retrospective and prospective. Finally, we found that only 5% of the data in the past 30 years can basically approach the real cell-based results. But this has been a good result, at least proving that the number keeps growing.
Synthesis has always been the bottleneck of small molecule drugs. The relevant articles in Science magazine show that AI passed the Turing test means that the total synthesis path of natural products is predictable. But the bottleneck of synthesis is not route prediction, but reaction condition prediction.
AlphaFold has attracted much attention and is considered an epoch-making feat. But we need to confirm three questions first:
First, does the pharmaceutical industry need to know the structure? Normal biological discovery can be directly screened on cells or directly purified on protein screen. What is known is only sequence and binding affinity, and process simulation is not required, but the advantage of process simulation is that some key sites can be modified.
Secondly, compared with the traditional homologous modeling, the result of AlphaFold prediction is better when there is a known template. Among them, the flow involving the deep algorithm in AlphaFold uses Multisequence alignment, which is to use the information of all protein families of all other species to predict the information of higher organisms, which will cause problems in many core areas. If the traditional homology modeling is generally a more similar species or the same proteome group of the same species, it can be directly predicted on the known model. Therefore, in the real pharmaceutical process, the traditional homology modeling has higher confidence. For proteins without templates, other methods need to be taken.
Finally, our approach is to predict biological activity directly from the primary structure, completely skipping the process of structure biology, and thus avoiding the error in this process.
In 2013, I spent two months calling 1024 CPUs to get about one microsecond of membrane protein, phospholipid bilayer and small molecule. At that time, it was the largest computable membrane protein system in the world, involving millions of atoms. With the same hardware configuration of today's supercomputers, the above time can be reduced to 2-3 days, but this is only a 30-fold increase, which means that it is still very difficult to calculate the dynamic process truly and systematically.
Therefore, we must make full use of the Data Driven AI model. Through the link below, you can view the relevant content of this rolling update review, which provides solutions to the data limitation problem and how to model.
There is a lot of noise in biological big data. How to extract signals from the noise and integrate clean data sets is particularly important. The industry provides a lot of solutions at the methodological, engineering and algorithmic levels, such as the Multimodal method. If the amount of data on one scale is small, it can be migrated from other scales, such as the multi-task method. If the data on a target is small, its family or all similar pocket data will be found for migration learning to make up for its data limitations.
The most useful AI model must be a model with strong generalization ability. It must be able to predict the unknown from the known. This is the most meaningful AI. Therefore, fundamentally speaking, transfer learning is the most effective method.
To make target specific prediction, experts only need to feed back a small number of results or several to dozens of data to conduct fine-tuning, and then generally only need to carry out five rounds of active learning to achieve the desired results, which is far better than the previous blind screening.
In addition, there are generally three ways to generate data:
First, from the existing data, we have collected all the commercial databases and 100+open source databases in the world, and finally eliminated 95% of the data, which is also a re-examination of history.
Second, we should do our own experiments to supplement some data pertinently. We need to clarify the chemical and biological spatial distribution of the data, and promote the optimal model performance with the least data points.
Third, for analog data, such as QM, the calculation is most accurate. First, use the physical bottom sampling, and finally use these data to replace the calculation force that has been consumed, without having to recalculate again.
At present, our integrated solution of Yuanyi Intelligence has an abstract display as shown in the figure. For details, see the official website. Starting from the target sequence, we can score dozens of AI models in a few hours at the same time, and can position 10-20 new molecules. Basically, it takes only 2-3 rounds to get the target compound within 100 minutes.
In addition, in terms of computing power, we have made very flexible solutions in terms of training, calling, GPU and CPU allocation, which has become a mature automation platform.
In June this year, Yuanyi Smart first released the multi-objective AI model at the Bio International Conference, providing the ability of automatic design for biological drugs, chemical drugs and nucleic acid drugs, and has close cooperation with many CRO, CDMO, and pharmaceutical companies around the world. The company has received orders of US $3 million a year since its establishment.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00