Why are life science companies migrating to the cloud one after another?
The life science industry is ushering in a golden period of development. The development of medicine and people's pursuit of health are rapidly transforming into a new driving force for the development of the entire life science industry chain. HPC plays a very important role in life science research. At the same time, with the rapid development of the life science industry, we can see that the industry's cloud has become an irresistible trend.
Thanks to the flexibility and convenience of the cloud, an industry's urgent demand for cloud computing is often inseparable from its rapid development. The long process of traditional IT stocking, delivery and deployment determines that it cannot meet the soaring IT demand of the fast-growing industry.
This article will start with the current situation of the life science industry and its huge demand for computing power, show what needs and pain points the industry faces at the infrastructure level, and answer why high-performance computing on the cloud will greatly help the rapid development of life science enterprises.
The demand of life science for computing power: large scale, high performance and rich types
At present, the two most important scenarios in the life science industry are computer-aided drug design and gene sequencing.
1. Computer-aided drug research and development
Since the 21st century, due to the increasing complexity of diseases, the number of drug targets has gradually decreased, and the difficulty and cost of new drug research and development have increased significantly. At the same time, the success rate of global new drug research and development has declined significantly. Innovative drug R&D is the key to building core competitiveness and sustainable development of pharmaceutical enterprises, and drug R&D is a system project with high investment, high technology, high risk and long cycle. Pharmaceutical companies began to seek AI, big data and other computer technology to assist drug research and development.
The whole process of drug research and development
The birth of new drugs usually needs to go through the stages of drug discovery, preclinical research, clinical trial and approval before they can be approved for marketing. In drug discovery stages such as target discovery and compound synthesis, as well as preclinical research stages such as compound screening, it is often necessary to rely on the powerful computing power of high-performance computing to accelerate the research and development process and assist in drug design.
When predicting protein structure in the target discovery process, there are both solutions based on molecular dynamics and plane wave, as well as solutions based on AI for Science.
The former is a typical application scenario of high-performance computing HPC, with solutions of mature software such as VASP and Gromacs. Simulation results are obtained through calculation. In this scheme, the scale of the simulation problem is proportional to the number of computing resources.
At the same time, the industry has gradually emerged with solutions such as AlphaFold2, which uses AI technology to establish the relationship between protein sequence and structure, and constantly learns the known sequence and structure to predict the protein structure. With the support of powerful algorithms and computing power, DeepMind has reduced the computing time from months to hours. With the increase of the scale of network model parameters, the demand for computational power is also increasing.
AI prediction of protein three-dimensional structure
Similarly, in the screening of virtual compounds, pharmaceutical companies usually need to connect millions of molecular and protein structures. Each ligand molecule requires computing resources to obtain docking scores to screen out molecules that can be used for experimental verification of activity. In the face of a vast library of ligand molecules, it requires huge computing power to support the docking of molecules and protein structures. Obviously, the computing power of a single machine is difficult to be competent for such a large-scale virtual filtering task, so it is very important to use high-performance computing HPC cluster for large-scale virtual filtering task.
Lead compound discovery process
In the process of target discovery, compound screening and compound synthesis, different computing models, parameters and software often have different requirements for computing resources. Especially with the introduction of AI, higher requirements are put forward for the diversified allocation of multiple resources.
2. Gene sequencing
The business process of gene sequencing mainly includes sample operation (sequencer), generation of sequencing file, gene sequence comparison and result analysis (computer), and delivery of result data and reports to scientific research and medical institutions. Among them, gene sequence alignment and analysis are extremely time-consuming, involving a large number of professional software in the field of bioinformatics. The computing power performance and scheme optimization of computing resources play a crucial role in the efficiency of bioinformatics research and development.
Gene sequencing business process
For the typical WGS (Human Genome Sequencing) process of gene sequencing, because it involves library index construction, reads comparison, sorting, duplication removal, BQSR correction, Caller and other links, the methods are diverse and the process is complex, and different steps correspond to different software and parameters such as BWA and GATK, and different bioinformatics software may correspond to different concurrent capabilities and performance, Different filtering tasks have different impacts on the diversity and scale of computing resources. It requires not only elastic computing resources, but also diverse instance configurations.
Second generation gene sequencing WGS sequencing process
Pain points and challenges faced by life science at the infrastructure level
Originally, most life science enterprises adopt the way of building their own IDC rooms offline. In general, the IT infrastructure of life science enterprises is mainly faced with three major problems: fixed resource scale, long construction cycle and high hardware resource operation and maintenance costs, as shown below:
1. Fixed resources, unable to meet the needs of business growth and resource diversity
1.1 The scale of computing power is fixed, affecting the business growth rate
At the beginning of building traditional IDC, the resource scale is often clearly planned, so the task throughput of the whole cluster is fixed. For the periodic new drug R&D and sequencing business, different R&D cycles and R&D tasks have different requirements for resources. Therefore, it usually happens that tasks queue up due to waiting for resources in the peak period, and resources are idle in the low period, which requires flexible computing resources to handle business.
1.2 The resource allocation is fixed and cannot meet the demand for resource diversity
The computing resources in the local IDC machine room are planned in the early stage, and the resource configuration is limited, which will lead to the traditional sequencing methods often use the same resources to complete the execution of different sequencing steps, unable to flexibly change the allocation, resulting in a large amount of computing resources waste. However, as mentioned earlier, the computing resources required are flexible.
1.3 Fixed storage capacity, unable to meet the growing storage needs of users
With regard to the growing storage scale, facing the great pressure of offline storage equipment operation and maintenance and storage equipment procurement costs, how to obtain efficient, safe, stable, cost-effective and sustainable storage solutions is also a major challenge for life science enterprises.
Taking the study of protein structure as an example, there are generally three methods to determine protein structure: X-ray crystallography, nuclear magnetic resonance and freeze electron microscopy. Take freeze-electron microscope as an example, the electron microscope data of a single sample is generally 10TB, and the local data volume of the enterprise is PB. At the same time, bioinformatics research data includes a large number of reference library data, sample data and intermediate data files. Among them, the whole process data of a single human genome sequencing is up to 1TB. Due to the periodicity and particularity of the bioinformatics data, the storage capacity of the local data of the normal bioinformatics enterprises is up to PB.
2. Long construction period, affecting business growth
2.1 The delivery cycle is long, which cannot meet the time-effective demand of users
The traditional IDC construction generally needs to go through the process of project approval, bidding, procurement and delivery, which often takes several months or even up to one year. In the process of project approval, it is necessary to evaluate the scale of the follow-up business and clarify the resource construction plan. For the fast developing business, such a long construction period will become the bottleneck of the fast developing business.
2.2 The iteration of hardware resource selection is slow, which cannot meet the user's constantly upgrading resource requirements
In the traditional IDC construction, it is often difficult for enterprises to quickly obtain the hardware resources of the latest architecture, and these resources can often bring considerable acceleration to the business.
For WGS sequencing, there is also a large number of selection and verification processes in the research and development of heterogeneous acceleration schemes based on GPU or FPGA. In the construction of offline IDC, it is not only necessary to consider the release time of CPU/GPU/FPGA and other products, select appropriate hardware specifications, but also to evaluate the development of business architecture, which will be a huge challenge for life science enterprises in building resources.
3. High operation and maintenance costs
The operation and maintenance of the offline IDC machine room also requires a large amount of manpower. In addition to the management of cluster computing resources, scheduling of computing tasks, and user rights management, the stability of computing resources itself, especially hardware failures, will have a serious impact on business progress. If the task is terminated due to downtime during the calculation process, it can only be recalculated without checkpoint. In addition, offline storage also needs to consider disaster tolerance to avoid data loss caused by hardware failure. Therefore, the management of computing resources, resource stability, data disaster recovery and other work need special operation and maintenance teams to be responsible for, which increases the cost invisibly.
At present, because the infrastructure provided by traditional IDC has the problems of resource limitation, long delivery cycle, inelastic resources, slow iteration and upgrading of hardware resources, high operation and maintenance costs, more and more life science enterprises are turning to more flexible, stable, cost-effective cloud high-performance computing solutions to accelerate the innovative development of business.
Alibaba Cloud provides EHPC life science series solutions
Alibaba Cloud believes that cloud high-performance computing is the best way to build and use HPC at present. According to the relevant needs of the life science industry, Alibaba Cloud, relying on the worldwide computing power and the DPCA architecture of the leading industry, provides high-performance computing public cloud solutions, hybrid cloud solutions, large-memory instance performance optimization solutions, containerized solutions, pharmaceutical AI solutions, etc., which can cover the needs of different scenarios in the industry, and has the following advantages:
(1) Rich computing power, purchase on demand: Alibaba Cloud operates 27 public cloud regions and 84 availability zones in four continents around the world; The auto-scaling capability on the cloud supports cross-data center scheduling, and the type of computing resources that meet the requirements of large-scale parallel jobs can also be flexibly configured according to the scheduler queue to support multiple specifications of heterogeneous computing power, as well as large memory type, high dominant frequency and other specifications of CPU instances;
(2) Elastic scaling, cost reduction and efficiency increase: Alibaba Cloud Elastic High Performance Computing (E-HPC) platform can dynamically create/delete computing nodes, flexibly configure scaling policies, and flexibly charge according to the actual load. The price of preemptive instances can be as low as 10% off, reducing customer use costs, and improving operation quality and speed;
(3) Minimal operation and maintenance allows enterprises to focus on core business development: fully compatible with HPC business, automatically build clusters, provide job operation performance analysis, locate hotspots based on clusters, instances, processes and other dimensions, support visual output of job reports, and provide consumption composition of users, tasks, queues and other dimensions;
(4) New technology empowerment and quick dividends: at the IaaS level, Alibaba Cloud continues to iterate the latest computing power. SaaS and PaaS have hundreds of third-party partners to integrate Alibaba Cloud, enabling life science enterprises to quickly access relevant technical services. Alibaba Cloud's rich ecosystem and continuous iterative technology capabilities on the cloud help enterprises enjoy the full process of technical services and the latest technology dividends.
Alibaba Cloud high-performance computing has been widely used in many industries, such as industrial simulation (CAD/CAE), chip design (EDA), biomedical materials, energy exploration and public services.
Shenshi Technology uses the cost optimization strategy of flexible supply and combines the price of preemptive instances to complete the delivery of massive resources at a cost of 30%. At the same time, the cloud elastic high-performance computing E-HPC automatic operation and maintenance features reduce the operation and maintenance costs of Shenshi Technology and improve the cluster management efficiency.
Life medicine enterprise "Shengting Medical" has optimized the data reliability, operation and maintenance costs and efficiency of traditional IDC clusters through cloud, and the efficiency of gene comparison and analysis has been improved by 70%. Alibaba Cloud's high-performance computing team also reduces the waste of invalid computing resources and effectively reduces the cost of use by combining the Slurm business workflow dependency with automatic scaling.
Thanks to the flexibility and convenience of the cloud, an industry's urgent demand for cloud computing is often inseparable from its rapid development. The long process of traditional IT stocking, delivery and deployment determines that it cannot meet the soaring IT demand of the fast-growing industry.
This article will start with the current situation of the life science industry and its huge demand for computing power, show what needs and pain points the industry faces at the infrastructure level, and answer why high-performance computing on the cloud will greatly help the rapid development of life science enterprises.
The demand of life science for computing power: large scale, high performance and rich types
At present, the two most important scenarios in the life science industry are computer-aided drug design and gene sequencing.
1. Computer-aided drug research and development
Since the 21st century, due to the increasing complexity of diseases, the number of drug targets has gradually decreased, and the difficulty and cost of new drug research and development have increased significantly. At the same time, the success rate of global new drug research and development has declined significantly. Innovative drug R&D is the key to building core competitiveness and sustainable development of pharmaceutical enterprises, and drug R&D is a system project with high investment, high technology, high risk and long cycle. Pharmaceutical companies began to seek AI, big data and other computer technology to assist drug research and development.
The whole process of drug research and development
The birth of new drugs usually needs to go through the stages of drug discovery, preclinical research, clinical trial and approval before they can be approved for marketing. In drug discovery stages such as target discovery and compound synthesis, as well as preclinical research stages such as compound screening, it is often necessary to rely on the powerful computing power of high-performance computing to accelerate the research and development process and assist in drug design.
When predicting protein structure in the target discovery process, there are both solutions based on molecular dynamics and plane wave, as well as solutions based on AI for Science.
The former is a typical application scenario of high-performance computing HPC, with solutions of mature software such as VASP and Gromacs. Simulation results are obtained through calculation. In this scheme, the scale of the simulation problem is proportional to the number of computing resources.
At the same time, the industry has gradually emerged with solutions such as AlphaFold2, which uses AI technology to establish the relationship between protein sequence and structure, and constantly learns the known sequence and structure to predict the protein structure. With the support of powerful algorithms and computing power, DeepMind has reduced the computing time from months to hours. With the increase of the scale of network model parameters, the demand for computational power is also increasing.
AI prediction of protein three-dimensional structure
Similarly, in the screening of virtual compounds, pharmaceutical companies usually need to connect millions of molecular and protein structures. Each ligand molecule requires computing resources to obtain docking scores to screen out molecules that can be used for experimental verification of activity. In the face of a vast library of ligand molecules, it requires huge computing power to support the docking of molecules and protein structures. Obviously, the computing power of a single machine is difficult to be competent for such a large-scale virtual filtering task, so it is very important to use high-performance computing HPC cluster for large-scale virtual filtering task.
Lead compound discovery process
In the process of target discovery, compound screening and compound synthesis, different computing models, parameters and software often have different requirements for computing resources. Especially with the introduction of AI, higher requirements are put forward for the diversified allocation of multiple resources.
2. Gene sequencing
The business process of gene sequencing mainly includes sample operation (sequencer), generation of sequencing file, gene sequence comparison and result analysis (computer), and delivery of result data and reports to scientific research and medical institutions. Among them, gene sequence alignment and analysis are extremely time-consuming, involving a large number of professional software in the field of bioinformatics. The computing power performance and scheme optimization of computing resources play a crucial role in the efficiency of bioinformatics research and development.
Gene sequencing business process
For the typical WGS (Human Genome Sequencing) process of gene sequencing, because it involves library index construction, reads comparison, sorting, duplication removal, BQSR correction, Caller and other links, the methods are diverse and the process is complex, and different steps correspond to different software and parameters such as BWA and GATK, and different bioinformatics software may correspond to different concurrent capabilities and performance, Different filtering tasks have different impacts on the diversity and scale of computing resources. It requires not only elastic computing resources, but also diverse instance configurations.
Second generation gene sequencing WGS sequencing process
Pain points and challenges faced by life science at the infrastructure level
Originally, most life science enterprises adopt the way of building their own IDC rooms offline. In general, the IT infrastructure of life science enterprises is mainly faced with three major problems: fixed resource scale, long construction cycle and high hardware resource operation and maintenance costs, as shown below:
1. Fixed resources, unable to meet the needs of business growth and resource diversity
1.1 The scale of computing power is fixed, affecting the business growth rate
At the beginning of building traditional IDC, the resource scale is often clearly planned, so the task throughput of the whole cluster is fixed. For the periodic new drug R&D and sequencing business, different R&D cycles and R&D tasks have different requirements for resources. Therefore, it usually happens that tasks queue up due to waiting for resources in the peak period, and resources are idle in the low period, which requires flexible computing resources to handle business.
1.2 The resource allocation is fixed and cannot meet the demand for resource diversity
The computing resources in the local IDC machine room are planned in the early stage, and the resource configuration is limited, which will lead to the traditional sequencing methods often use the same resources to complete the execution of different sequencing steps, unable to flexibly change the allocation, resulting in a large amount of computing resources waste. However, as mentioned earlier, the computing resources required are flexible.
1.3 Fixed storage capacity, unable to meet the growing storage needs of users
With regard to the growing storage scale, facing the great pressure of offline storage equipment operation and maintenance and storage equipment procurement costs, how to obtain efficient, safe, stable, cost-effective and sustainable storage solutions is also a major challenge for life science enterprises.
Taking the study of protein structure as an example, there are generally three methods to determine protein structure: X-ray crystallography, nuclear magnetic resonance and freeze electron microscopy. Take freeze-electron microscope as an example, the electron microscope data of a single sample is generally 10TB, and the local data volume of the enterprise is PB. At the same time, bioinformatics research data includes a large number of reference library data, sample data and intermediate data files. Among them, the whole process data of a single human genome sequencing is up to 1TB. Due to the periodicity and particularity of the bioinformatics data, the storage capacity of the local data of the normal bioinformatics enterprises is up to PB.
2. Long construction period, affecting business growth
2.1 The delivery cycle is long, which cannot meet the time-effective demand of users
The traditional IDC construction generally needs to go through the process of project approval, bidding, procurement and delivery, which often takes several months or even up to one year. In the process of project approval, it is necessary to evaluate the scale of the follow-up business and clarify the resource construction plan. For the fast developing business, such a long construction period will become the bottleneck of the fast developing business.
2.2 The iteration of hardware resource selection is slow, which cannot meet the user's constantly upgrading resource requirements
In the traditional IDC construction, it is often difficult for enterprises to quickly obtain the hardware resources of the latest architecture, and these resources can often bring considerable acceleration to the business.
For WGS sequencing, there is also a large number of selection and verification processes in the research and development of heterogeneous acceleration schemes based on GPU or FPGA. In the construction of offline IDC, it is not only necessary to consider the release time of CPU/GPU/FPGA and other products, select appropriate hardware specifications, but also to evaluate the development of business architecture, which will be a huge challenge for life science enterprises in building resources.
3. High operation and maintenance costs
The operation and maintenance of the offline IDC machine room also requires a large amount of manpower. In addition to the management of cluster computing resources, scheduling of computing tasks, and user rights management, the stability of computing resources itself, especially hardware failures, will have a serious impact on business progress. If the task is terminated due to downtime during the calculation process, it can only be recalculated without checkpoint. In addition, offline storage also needs to consider disaster tolerance to avoid data loss caused by hardware failure. Therefore, the management of computing resources, resource stability, data disaster recovery and other work need special operation and maintenance teams to be responsible for, which increases the cost invisibly.
At present, because the infrastructure provided by traditional IDC has the problems of resource limitation, long delivery cycle, inelastic resources, slow iteration and upgrading of hardware resources, high operation and maintenance costs, more and more life science enterprises are turning to more flexible, stable, cost-effective cloud high-performance computing solutions to accelerate the innovative development of business.
Alibaba Cloud provides EHPC life science series solutions
Alibaba Cloud believes that cloud high-performance computing is the best way to build and use HPC at present. According to the relevant needs of the life science industry, Alibaba Cloud, relying on the worldwide computing power and the DPCA architecture of the leading industry, provides high-performance computing public cloud solutions, hybrid cloud solutions, large-memory instance performance optimization solutions, containerized solutions, pharmaceutical AI solutions, etc., which can cover the needs of different scenarios in the industry, and has the following advantages:
(1) Rich computing power, purchase on demand: Alibaba Cloud operates 27 public cloud regions and 84 availability zones in four continents around the world; The auto-scaling capability on the cloud supports cross-data center scheduling, and the type of computing resources that meet the requirements of large-scale parallel jobs can also be flexibly configured according to the scheduler queue to support multiple specifications of heterogeneous computing power, as well as large memory type, high dominant frequency and other specifications of CPU instances;
(2) Elastic scaling, cost reduction and efficiency increase: Alibaba Cloud Elastic High Performance Computing (E-HPC) platform can dynamically create/delete computing nodes, flexibly configure scaling policies, and flexibly charge according to the actual load. The price of preemptive instances can be as low as 10% off, reducing customer use costs, and improving operation quality and speed;
(3) Minimal operation and maintenance allows enterprises to focus on core business development: fully compatible with HPC business, automatically build clusters, provide job operation performance analysis, locate hotspots based on clusters, instances, processes and other dimensions, support visual output of job reports, and provide consumption composition of users, tasks, queues and other dimensions;
(4) New technology empowerment and quick dividends: at the IaaS level, Alibaba Cloud continues to iterate the latest computing power. SaaS and PaaS have hundreds of third-party partners to integrate Alibaba Cloud, enabling life science enterprises to quickly access relevant technical services. Alibaba Cloud's rich ecosystem and continuous iterative technology capabilities on the cloud help enterprises enjoy the full process of technical services and the latest technology dividends.
Alibaba Cloud high-performance computing has been widely used in many industries, such as industrial simulation (CAD/CAE), chip design (EDA), biomedical materials, energy exploration and public services.
Shenshi Technology uses the cost optimization strategy of flexible supply and combines the price of preemptive instances to complete the delivery of massive resources at a cost of 30%. At the same time, the cloud elastic high-performance computing E-HPC automatic operation and maintenance features reduce the operation and maintenance costs of Shenshi Technology and improve the cluster management efficiency.
Life medicine enterprise "Shengting Medical" has optimized the data reliability, operation and maintenance costs and efficiency of traditional IDC clusters through cloud, and the efficiency of gene comparison and analysis has been improved by 70%. Alibaba Cloud's high-performance computing team also reduces the waste of invalid computing resources and effectively reduces the cost of use by combining the Slurm business workflow dependency with automatic scaling.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00