High-performance computing on the cloud facilitates gene sequencing

01 Introduction to cause-finding organisms

The founding team and company of Sunrise Biotech, a biotechnology enterprise focusing on single-cell technology, officially launched its establishment and operation in 2018. It is committed to universalize single-cell technology, assist clinical diagnosis and drug research and development, and promote precision medicine into the 2.0 era through the self-developed high-throughput single-cell product experiment and bioinformatics analysis full-chain service.

Founded in 2018 and based in the medical industrial park of Peking University, this start-up obtained round D financing in January 2022, and set up local laboratories in Shanghai, Guangzhou and Chengdu. Facing the actual needs of clinical and scientific researchers, the company has built a full-chain single-cell sequencing product and service solution from sample preservation, dissociation to bioinformatics analysis. The customer sends the samples to the laboratories in Shanghai, Guangzhou or Chengdu.

After the samples are collected, the sequence processing and signal amplification are carried out in the laboratory through experiments, and the molecules are expanded circularly to label each molecule and cell. Assist in identifying which cells and genes the molecules tested come from. Then it will be sent to Beijing for sequencing through logistics. The sequencing results are uploaded to the Alibaba Cloud OSS for cause finding or downloaded to the local through a dedicated line, and then single-cell analysis is performed.

02 Single cell sequencing and pain points

Single-cell sequencing, as a technology first introduced in 2009, is undoubtedly the most popular basic research in life science. In 2013, single-cell RNA sequencing was rated as the technology of the year by nature method. In 2015, single-cell sequencing technology was again on the cover of science translational medicine.

Single-cell sequencing, as the name implies, is to detect the gene expression and other information of cells at the level of a single cell. For multicellular organisms, there are differences between cells.

Compared with traditional sequencing research, it is limited to organs and tissues. The expression level of the population cells, the final signal value, lost the inhibitory information. Single-cell sequencing can explain the differences between cells and their functions in the environment with higher resolution, and prevent the cells from making up the number.

This technology has been applied in basic scientific research, clinical diagnosis, new drug research and development and other fields. As an efficient medical aid, gene sequencing has provided effective help in the fields of birth defect prevention, detection of genetic diseases, tumor drugs and so on.

The vigorous development of single-cell sequencing technology has also helped to start the business of cause-finding biology. Since its commercial sale in March 2021, the company has established scientific research cooperation relationships with 100 customers. Through the flow chart of typical single-cell data analysis on the left side of the screen, we can see that in the last step of single-cell sequencing, the data analysis phase needs to carry out data preprocessing for single-cell sequencing data.

Such as quality control, normalization, data correction, feature selection, cluster analysis, trajectory analysis, differential expression analysis, gene dynamics, metastable analysis, component analysis, etc. The file size of only one single cell sequencing can reach more than 100GB. As a single cell project contains more and more samples, the cell data level often reaches hundreds of GB or even TB.

Secondly, the analysis of single cell data is complex and requires repeated data reading and parameter adjustment. As a result, the analysis task of processing massive cell samples usually takes hours or even days to complete.

When the sample size comes up, various associations or more complex calculations have to be made between the samples. The consumption of computational power will be very large, and the requirements for computational power will be higher.

Large amount of data and analysis complexity will lead to low task concurrency and slow data loading rate. In addition, the bioinformatics industry lacks an open source software covering the whole process. Usually, a biocomputing project requires the cooperation of multiple software. With the gradual reduction of the cost of single cell detection and the wider application, the bioinformatics data will grow exponentially.

The usual operation of bioinformatics analysis is to lower the sample parameters, or only run a relatively large single-cell analysis task. However, in the case of multiple sequencing tasks, multiple single-cell analysis projects can only be queued for execution.

03 E-HPC-based large memory instance solution

To solve the above problems, Alibaba Cloud has built a large memory instance solution based on E-HPC for cause-finding organisms. This program mainly consists of three parts. The first part is a large memory cloud instance, with memory virtualization software launched by partners.

In 2017, Intel Itanium SSD was launched; In 2020, Intel released the Itanium Persistent Memory 100 series, successfully completing large-scale commercialization; In 2021, Intel released the third generation of Intel Xeon scalable processors and Intel Itanium persistent memory 200 series. In the same year, Alibaba Cloud developed more powerful different instance specifications based on the above products.

Among them, I4P can provide extremely high performance local disk latency, which can be reduced to 170 nanoseconds. It is very suitable for heavy IO applications to help such applications break through the performance bottleneck.

The single-cell sequencing analysis task of cause-finding organisms is deployed on the third-generation Intel Xeon scalable processor, and the second-generation Intel Itanium persistent memory, I4P persistent memory type instances. With Memory Machine, the first software to virtualize memory hardware, it provides refined resources for capacity, performance, availability and mobility.

On the basis of transparent memory service, another industry-first technology Zero Io memory snapshot is also provided. This technology can encapsulate terabytes of application state in a few seconds, and realize data management at memory speed.

The second part is Alibaba Cloud's computing nest model. Cloud vendors open the PaaS platform for enterprise application service providers and their customer service management. Alibaba Cloud enables the standardized integration of Memory Machine large memory virtualization software and cloud platform to accelerate software delivery and deployment, and standardizes operation and maintenance management, greatly improving business efficiency.

The third part is the Alibaba Cloud elastic high-performance computing platform E-HPC, which can automatically manage and schedule ECS and storage instances of different specifications used by the underlying biological agents. One-click installation and deployment of life science-related software and its operating environment.

It is automatically released at the peak of business expansion and at the bottom of business expansion to avoid resource waste and greatly save operation and maintenance costs. In addition, E-HPC can install and deploy HPC and software with one click, eliminating the complicated work of installing software separately for each instance.

Alibaba Cloud's large-memory instance solution helps the cause-finding biological business in the following four aspects.

First, calculate quickly. E-HPC solution simplifies the process of writing, monitoring task delivery and task calculation. The data loading and export performance is reduced from 1000 seconds to 2.5 seconds; The sample size of a single task is twice the original size. When the running time is almost the same as that of a single task, the number of well-run sequencing tasks has been increased from 1 to 5, and the task processing efficiency has been increased by 5 times.

Second, low cost. E-HPC ensures the overall computing power while dynamically creating/deleting computing nodes to avoid resource waste; Improve job quality and speed, output rich cloud native capabilities, support ECS to support preemptive instances, and OSS to support cold archiving: multiple payment modes, combined with business needs and the performance and cycle of data storage, support cost based fine adjustment.

Third, simplify operation and maintenance. E-HPC will automatically manage and schedule ECS instances of different specifications used at the bottom of biology. It can install and deploy HPC software related to life science and its operating environment in one click. It will automatically manage and schedule ECS instances with MemVerge software, greatly saving operation and maintenance costs. Through the standardized integration of the MemoryMachine large memory virtualization software and cloud platform, Alibaba Cloud will accelerate software delivery and deployment, and standardize operation and maintenance management, It has greatly improved business efficiency.

Fourth, help ecology. Alibaba Cloud has been deeply involved in the biological information industry for many years, and has formed a variety of service solutions and customer resources, which can provide more support for the interconnection of upstream and downstream biotechnology enterprises. Based on Alibaba Cloud, it has developed a single-cell analysis platform that directly provides services to users, giving scientific research users and drug research and development users the ability to analyze single-cell data.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us