Cloud supercomputing solutions comprehensively help the life science industry benefit and efficiency
01 Overview of life science industry
Life science is a science that studies life phenomena, reveals the laws of life activities and the essence of life. Generally, the industry of enterprises serving scientific research is called the science service industry, and the industry of enterprises serving life science research is called the life science service industry. Life science and technology is an advanced science and technology with molecular genetics as the core. The first question to be answered by life science is "what is life".
The main fields of life science are medicine, biology, banking, genetics and other related fields. In the market, closely related customer groups are mainly hospitals, research and development, scientific research, etc.
The industry chain of the whole industry is divided into three levels: upper, middle and lower. Upstream is mainly equipment production and software research and development. In the competition, Murphy and Huadu are relatively famous upstream manufacturers; The midstream is mainly dominated by service providers; Downstream are medical institutions, scientific research sites, pharmaceutical companies and other services.
It can be seen that the upstream holds the lifeline of the whole industry. Midstream is a life science service provider for end users to provide corresponding services for users, from which service fees are charged. The downstream is the service user, which determines the market capacity, development prospects and business model characteristics of the midstream service segment circuit.
Taking gene detection technology as an example, second-generation gene sequencing is currently the most popular technology for gene detection. It is mainly used to analyze and determine the whole gene sequence from blood or saliva to predict the possibility of suffering from various diseases.
Gene sequencing related products and technologies have evolved from laboratory research to clinical use. It can be said that gene sequencing technology is the next technology to change the world. Compared with PCR and FISH technology, it has the characteristics of high throughput and large data volume. The disadvantages of gene sequencing technology are complex operation, high requirements for sample DNA concentration and purity, and huge and complex data.
Combined with the typical business of genomics, complete genome sequencing. After 13 years, the Human Genome Project (HGP) was completed in 2003, resulting in revolutionary changes in the entire field of gene sequencing. Subsequently, many large-scale sequencing projects funded by the government have also been launched, such as the 1000 Genome Project and the 10K Project, greatly promoting human research and discovery of genetic variation, human evolution, and genetic diseases.
In the field of computer science and information, the whole genome sequencing process based on GATK plays a vital role in modern gene sequencing.
In the typical genome sequencing business, there are a large number of application software and different ways of use. There are also a large number of serial software. The typical whole genome sequencing process has two main characteristics.
First, it takes a long time to execute, conventional processes and general computing resources. A human genome sample needs nearly 1000 cores to be processed. Second, the amount of data is large, and a single sample can produce 1TB of intermediate data on average.
Therefore, combining cluster scheduler to improve concurrency efficiency, combining heterogeneous solutions to accelerate execution performance, deploying different business images based on containers, and storing hot and cold data backup have become the main analysis topics in the field of computer information generation.
02 Life Science Industry Analysis
Traditional supercomputing schemes are mainly connected through offline supercomputing clusters or self-built computer rooms. At present, there are three main problems.
1. It is difficult to maintain resources after aging. After the user resources are aged and protected, a lot of manpower and financial resources need to be invested in resource reuse and maintenance.
2. Peak and valley effect of business. Because of the limited resources, the queue time of jobs in the peak period is long, and the utilization rate of resources in the valley period is low.
The existing clusters cannot meet the needs of new businesses and technological innovation, lack scalability, and have a relatively long procurement cycle.
With the continuous evolution of genome and the continuous development of computing technology, traditional genome sequencing has been unable to meet the development demands of existing businesses.
The traditional high-performance computing business process is mainly divided into the following three stages: processing before the business, submission, scheduling, and execution during the business, and visual analysis after the business.
In the upper right corner, the job is submitted to the scheduler. The scheduler does the scheduling and distribution of offline machines. According to the job running configuration and the current resource situation, it schedules suitable resources to participate in the calculation job.
Traditional industry solutions have the following characteristics: poor scalability, performance bottlenecks, difficult management and maintenance, and new technology challenges. Among them, the more obvious ones are performance bottlenecks, insufficient peak computing power, and long job queue time, which seriously affect the business.
In terms of management and maintenance, the investment cost is relatively high, and the unified software control, safety assurance, and construction operation and maintenance integration scheme are insufficient.
03 Cloud Supercomputing Solution
Alibaba Cloud's high-performance computing product E-HPC is mainly a software service that combines the business habits of high-performance computing with the advantages of cloud computing. Large scale cluster deployment and reasoning, flexible use of resources, workflow front to back assurance, job scheduling and operation management, multi client security isolation, performance analysis and tuning.
As an infrastructure, HPC meets the requirements of high performance computing business scenario and reliability. Computing, storage, network, and graphics visualization meet users' ultimate performance demands, low latency network communication, and large-scale push of parallel file systems.
In terms of linear expansion, Alibaba Cloud high-performance products combine with more than 30 applications in the life science industry to provide lightweight convenience. For the credit reporting industry, it is compatible with many mainstream credit reporting software in the market and provides a unified portal for life science.
Alibaba Cloud provides cluster computing power, elastic scaling, multi-level caching, business management, and resource lifecycle management services on the PaaS layer. The bottom layer is the computing resources of Alibaba Cloud platform and the DPCA virtualization technology. Users can choose a variety of computing instance specifications.
The public cloud solution for high-performance computing is to build E-HPC on the cloud through full cloud access, providing resource scheduling, job management, elastic scaling and other capabilities.
There are two kinds of hybrid cloud solutions for high-performance computing. First, the scheduling node is online and offline. When the resources are insufficient, the new node is expanded online. The application scenario is based on local construction, and the cloud is used to meet sudden business needs. It is conducive to quickly meet the sudden demand and release it at any time.
Second, the scheduling node is in the E-HPC cluster, while managing the existing offline computing nodes. The local computer room construction is mainly based on the cloud construction. It is conducive to the gradual transition of the old infrastructure under the cloud.
The CPU memory of the life science big computing solution provides 1:2, 1:4, and 1:8 instances, as well as high frequency instances. The upper layer is the resource scheduling control of E-HPC.
In the large memory instance performance optimization solution, E-HPC, based on Alibaba Cloud infrastructure, provides users with one-stop public cloud HPC services, and provides a fast, flexible, secure and technical computing cloud platform that is interoperable with Alibaba Cloud products. HPC elastic scaling, which automatically manages MemVerge nodes, expands ECS with MemVerge software at peak business hours, and releases it at low time to save costs.
HPC job scheduling is based on large memory instance calculation with MemVerge software. In gene sequencing and EDA chip design scenarios, performance optimization is achieved.
E-HPC+MemVerge software+ECS i4p instance is installed and deployed with one click, and Memverge software is automatically deployed on ECS, which solves the cumbersome and inefficient problem of manual deployment of Memverge software once per i4p deployment.
In the pharmaceutical AI solution, there are five links: data collection, cleaning and labeling, model training, model deployment and reasoning. Alibaba Cloud ECC1G-10G's dedicated network line solves the problem of collecting data to the cloud. OSS object storage supports massive data storage, data distribution/archiving. NAS/CPFS well line file storage provides high throughput, low latency, up to 100 GB/s throughput and millions of IOPS, multiple I/O models, and mixed workload of large files and small files.
04 Key features and scheme advantages
The advantage of E-HPC is to quickly create HPC clusters on the cloud. Under the cloud, network planning, software initialization and account processing are required. On the cloud, it only takes half an hour to build the HPC cluster.
The performance of HPC application is analyzed and optimized layer by layer. Alibaba provides optimization analysis at all levels based on system and process function instructions, microservice architecture, and HPC applications.
E-HPC automatic scaling supports cross-data centers. The computing resources of a cluster can be in different data centers to meet the requirements of large-scale parallel jobs. The type of computing resources can be flexibly configured according to the HPC scheduler queue.
In the whole data flow visualization, log in to the control node through web page or SSH before operation. During job operation, you can monitor and manage resources through performance analysis and process analysis. At the end of the job, you can do visual data processing and analysis through the cloud desktop of Alibaba Cloud resources.
The advantage of E-HPC is that it has rich computing power, and automatic scaling supports cross-data centers to meet the requirements of large-scale parallel operations. It supports multiple specifications of heterogeneous computing power, as well as large memory type, high frequency and other specifications of CPU instances.
In terms of cost, E-HPC can dynamically create/delete computing nodes and charge elastically according to the actual load. Flexible configuration of scaling strategy, support of preemptive instances, support of cross-zone scaling, and reduce customer use costs.
In terms of operation and maintenance, E-HPC is fully compatible with HPC services, and automatic multi-zone cluster construction. Provide job operation performance analysis, and locate hot spots based on cluster, instance, process and other dimensions. In the new technology innovation, E-HPC provides ecological SaaS, PaaS empowerment, such as GPU, FPGA, Yitian and other new products.
Rich computing power, optimal cost, minimal operation and maintenance, and new technology empowerment. E-HPC provides all-round assistance to the life science industry, and truly realizes inclusive benefits and efficiency.
Life science is a science that studies life phenomena, reveals the laws of life activities and the essence of life. Generally, the industry of enterprises serving scientific research is called the science service industry, and the industry of enterprises serving life science research is called the life science service industry. Life science and technology is an advanced science and technology with molecular genetics as the core. The first question to be answered by life science is "what is life".
The main fields of life science are medicine, biology, banking, genetics and other related fields. In the market, closely related customer groups are mainly hospitals, research and development, scientific research, etc.
The industry chain of the whole industry is divided into three levels: upper, middle and lower. Upstream is mainly equipment production and software research and development. In the competition, Murphy and Huadu are relatively famous upstream manufacturers; The midstream is mainly dominated by service providers; Downstream are medical institutions, scientific research sites, pharmaceutical companies and other services.
It can be seen that the upstream holds the lifeline of the whole industry. Midstream is a life science service provider for end users to provide corresponding services for users, from which service fees are charged. The downstream is the service user, which determines the market capacity, development prospects and business model characteristics of the midstream service segment circuit.
Taking gene detection technology as an example, second-generation gene sequencing is currently the most popular technology for gene detection. It is mainly used to analyze and determine the whole gene sequence from blood or saliva to predict the possibility of suffering from various diseases.
Gene sequencing related products and technologies have evolved from laboratory research to clinical use. It can be said that gene sequencing technology is the next technology to change the world. Compared with PCR and FISH technology, it has the characteristics of high throughput and large data volume. The disadvantages of gene sequencing technology are complex operation, high requirements for sample DNA concentration and purity, and huge and complex data.
Combined with the typical business of genomics, complete genome sequencing. After 13 years, the Human Genome Project (HGP) was completed in 2003, resulting in revolutionary changes in the entire field of gene sequencing. Subsequently, many large-scale sequencing projects funded by the government have also been launched, such as the 1000 Genome Project and the 10K Project, greatly promoting human research and discovery of genetic variation, human evolution, and genetic diseases.
In the field of computer science and information, the whole genome sequencing process based on GATK plays a vital role in modern gene sequencing.
In the typical genome sequencing business, there are a large number of application software and different ways of use. There are also a large number of serial software. The typical whole genome sequencing process has two main characteristics.
First, it takes a long time to execute, conventional processes and general computing resources. A human genome sample needs nearly 1000 cores to be processed. Second, the amount of data is large, and a single sample can produce 1TB of intermediate data on average.
Therefore, combining cluster scheduler to improve concurrency efficiency, combining heterogeneous solutions to accelerate execution performance, deploying different business images based on containers, and storing hot and cold data backup have become the main analysis topics in the field of computer information generation.
02 Life Science Industry Analysis
Traditional supercomputing schemes are mainly connected through offline supercomputing clusters or self-built computer rooms. At present, there are three main problems.
1. It is difficult to maintain resources after aging. After the user resources are aged and protected, a lot of manpower and financial resources need to be invested in resource reuse and maintenance.
2. Peak and valley effect of business. Because of the limited resources, the queue time of jobs in the peak period is long, and the utilization rate of resources in the valley period is low.
The existing clusters cannot meet the needs of new businesses and technological innovation, lack scalability, and have a relatively long procurement cycle.
With the continuous evolution of genome and the continuous development of computing technology, traditional genome sequencing has been unable to meet the development demands of existing businesses.
The traditional high-performance computing business process is mainly divided into the following three stages: processing before the business, submission, scheduling, and execution during the business, and visual analysis after the business.
In the upper right corner, the job is submitted to the scheduler. The scheduler does the scheduling and distribution of offline machines. According to the job running configuration and the current resource situation, it schedules suitable resources to participate in the calculation job.
Traditional industry solutions have the following characteristics: poor scalability, performance bottlenecks, difficult management and maintenance, and new technology challenges. Among them, the more obvious ones are performance bottlenecks, insufficient peak computing power, and long job queue time, which seriously affect the business.
In terms of management and maintenance, the investment cost is relatively high, and the unified software control, safety assurance, and construction operation and maintenance integration scheme are insufficient.
03 Cloud Supercomputing Solution
Alibaba Cloud's high-performance computing product E-HPC is mainly a software service that combines the business habits of high-performance computing with the advantages of cloud computing. Large scale cluster deployment and reasoning, flexible use of resources, workflow front to back assurance, job scheduling and operation management, multi client security isolation, performance analysis and tuning.
As an infrastructure, HPC meets the requirements of high performance computing business scenario and reliability. Computing, storage, network, and graphics visualization meet users' ultimate performance demands, low latency network communication, and large-scale push of parallel file systems.
In terms of linear expansion, Alibaba Cloud high-performance products combine with more than 30 applications in the life science industry to provide lightweight convenience. For the credit reporting industry, it is compatible with many mainstream credit reporting software in the market and provides a unified portal for life science.
Alibaba Cloud provides cluster computing power, elastic scaling, multi-level caching, business management, and resource lifecycle management services on the PaaS layer. The bottom layer is the computing resources of Alibaba Cloud platform and the DPCA virtualization technology. Users can choose a variety of computing instance specifications.
The public cloud solution for high-performance computing is to build E-HPC on the cloud through full cloud access, providing resource scheduling, job management, elastic scaling and other capabilities.
There are two kinds of hybrid cloud solutions for high-performance computing. First, the scheduling node is online and offline. When the resources are insufficient, the new node is expanded online. The application scenario is based on local construction, and the cloud is used to meet sudden business needs. It is conducive to quickly meet the sudden demand and release it at any time.
Second, the scheduling node is in the E-HPC cluster, while managing the existing offline computing nodes. The local computer room construction is mainly based on the cloud construction. It is conducive to the gradual transition of the old infrastructure under the cloud.
The CPU memory of the life science big computing solution provides 1:2, 1:4, and 1:8 instances, as well as high frequency instances. The upper layer is the resource scheduling control of E-HPC.
In the large memory instance performance optimization solution, E-HPC, based on Alibaba Cloud infrastructure, provides users with one-stop public cloud HPC services, and provides a fast, flexible, secure and technical computing cloud platform that is interoperable with Alibaba Cloud products. HPC elastic scaling, which automatically manages MemVerge nodes, expands ECS with MemVerge software at peak business hours, and releases it at low time to save costs.
HPC job scheduling is based on large memory instance calculation with MemVerge software. In gene sequencing and EDA chip design scenarios, performance optimization is achieved.
E-HPC+MemVerge software+ECS i4p instance is installed and deployed with one click, and Memverge software is automatically deployed on ECS, which solves the cumbersome and inefficient problem of manual deployment of Memverge software once per i4p deployment.
In the pharmaceutical AI solution, there are five links: data collection, cleaning and labeling, model training, model deployment and reasoning. Alibaba Cloud ECC1G-10G's dedicated network line solves the problem of collecting data to the cloud. OSS object storage supports massive data storage, data distribution/archiving. NAS/CPFS well line file storage provides high throughput, low latency, up to 100 GB/s throughput and millions of IOPS, multiple I/O models, and mixed workload of large files and small files.
04 Key features and scheme advantages
The advantage of E-HPC is to quickly create HPC clusters on the cloud. Under the cloud, network planning, software initialization and account processing are required. On the cloud, it only takes half an hour to build the HPC cluster.
The performance of HPC application is analyzed and optimized layer by layer. Alibaba provides optimization analysis at all levels based on system and process function instructions, microservice architecture, and HPC applications.
E-HPC automatic scaling supports cross-data centers. The computing resources of a cluster can be in different data centers to meet the requirements of large-scale parallel jobs. The type of computing resources can be flexibly configured according to the HPC scheduler queue.
In the whole data flow visualization, log in to the control node through web page or SSH before operation. During job operation, you can monitor and manage resources through performance analysis and process analysis. At the end of the job, you can do visual data processing and analysis through the cloud desktop of Alibaba Cloud resources.
The advantage of E-HPC is that it has rich computing power, and automatic scaling supports cross-data centers to meet the requirements of large-scale parallel operations. It supports multiple specifications of heterogeneous computing power, as well as large memory type, high frequency and other specifications of CPU instances.
In terms of cost, E-HPC can dynamically create/delete computing nodes and charge elastically according to the actual load. Flexible configuration of scaling strategy, support of preemptive instances, support of cross-zone scaling, and reduce customer use costs.
In terms of operation and maintenance, E-HPC is fully compatible with HPC services, and automatic multi-zone cluster construction. Provide job operation performance analysis, and locate hot spots based on cluster, instance, process and other dimensions. In the new technology innovation, E-HPC provides ecological SaaS, PaaS empowerment, such as GPU, FPGA, Yitian and other new products.
Rich computing power, optimal cost, minimal operation and maintenance, and new technology empowerment. E-HPC provides all-round assistance to the life science industry, and truly realizes inclusive benefits and efficiency.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00