Article | Sun Xiangzheng, Alibaba Cloud high-performance computing technology expert
after the outbreak of the new coronavirus epidemic, Alibaba Cloud provides high-performance computing, SCC supercomputing clusters, CPU/GPU machines, cloud supercomputing, and AI technologies to public scientific research institutions around the world for free. Recently, many research institutions and universities have conducted numerical calculations related to drug research and development on Alibaba Cloud E-HPC cloud supercomputing. The Alibaba cloud supercomputing team has provided technical support and follow-up. This article mainly introduces the drug screening phase, E-HPC how cloud supercomputing can help developers quickly process a large number of small sub-libraries concurrently. At the same time, it introduces the Alibaba Cloud solution of the GHDDI computing and results sharing open platform of the Global Health Drug Research and development center.
the birth cycle of a drug is extremely long. It takes at least 10 years from the earliest research and development of new drugs to the listing.
Time is especially precious in the context of the epidemic. Therefore, in this process, many scientists will try to find drugs that can treat the new crown from the existing drugs, eliminating the subsequent steps such as a large number of approval and listing.
In the compound discovery stage, the previous method was to screen through a large number of experiments to find possible suitable compounds. Today, scientists are trying to simulate the interaction between molecular compounds and targets through machines to screen out potentially effective compounds for experiments.
In this process, High Performance Computing (High Performance Computing, Short HPC), are often called "Supercomputing", is a modern drug development indispensable support.
The rise of cloud computing has changed the way scientists obtain computing power and enjoy supercomputing services. For example, Alibaba Cloud E-HPC cloud super computing products that allow scientists to build high-performance cluster systems on the cloud to meet the needs of drug developers for computing platforms.
In addition, computing power on the cloud is large and flexible, and scientists can buy on demand without worrying about the development speed limited by computing power.
What is the relationship between specific viruses, drug research and development, and high-performance computing? We will begin with how viruses replicate and spread in the host, to examples of drug inhibition methods, and finally give the role of high-performance computing in drug research and development.
Virus and drug development
viruses are non-cellular forms composed of nucleic acid molecules (DNA or RNA) and proteins, as shown in the following figure tobacco mosaic virus. Because it is non-cellular, it cannot achieve the quantity growth through cell division. They synthesize their own copies by entering the host cells and using metabolic tools in the host cells, and complete the virus assembly [1]. Coronavirus (CoV) is a group of highly homologous, single-chain positive translation RNA viruses, which has the above virus characteristics and can cause a variety of respiratory tract, intestinal tract with different severity, liver and nervous system diseases, two new types that have emerged in the past 12 years, namely severe acute respiratory syndrome (SARS-CoV) and Middle East Respiratory Syndrome (MERS-CoV)[2], and the currently looted COVID-19 belong to this virus.
COVID-19 virus
molecular structure of a virus protein [4]
after the virus enters the host cell, the viral genome completes replication, transcription (except the positive translation of RNA virus) and viral protein synthesis, and then assembles into more target viruses, the life flow is shown in the following figure (simple diagram without envelope virus).
Using drugs to interfere with the virus replication process can effectively inhibit the damage of the virus to the body. For example, in the synthesis process of viral protein, protease intervention is required, such as 3cl protease and ProPL protease. Inhibiting the function of protease is one of the methods to inhibit virus. The structure on protease that can be recognized or combined by other substances (ligands, drugs) is called Target (Biological Target). Find the ligand (small molecule drug) which can be combined with the proper target of viral protease, and change the three-dimensional structure of protease through the action of the drug, thus changing its function and hindering the synthesis of viral protein, it causes the virus to fail to replicate and achieves the effect of inhibiting virus replication. [3]drug development and high performance computing
drug research and development is a very complicated and time-consuming process, and drug screening is only a step in the early stage of the process. For example, the aforementioned small molecule looking for binding to protein virus enzyme, due to the existence of ligand (small molecule) libraries of different kinds or research institutions, the number of ligand (small molecule) libraries is huge, the number of ligands in each ligand library is tens of thousands, or even larger. It is not practical to test and verify them one by one through experiments. Screening through computer numerical simulation, scoring the binding effects of different ligands, screening out some ligands with high scores and reasonable binding mode as candidate drugs for experimental verification, it can effectively accelerate the research process of drugs. Due to the huge configuration library, it is also a huge challenge to complete filtering in a limited time. For example, the ligand library has 10,000 candidate ligands, and the average processing time of each ligand is 1.5 hours, which requires a total of 15,000 hours (625 days). Therefore, in order to finish the calculation within the specified time, the following conditions need to be met: • a computing platform with strong computing power; • large-capacity storage for storing processed data and calculation results; In addition, to ensure the efficient and smooth completion of filtering and computing, computing services are also required, including: • cluster software runtime environment to ensure software operation and data access in a multi-machine environment; • Supports concurrent multi-task processing in a multi-machine environment. In addition to computing platforms, drug screening also requires high-performance application software. The simulation calculation of drug screening includes Docking and molecular dynamics calculation: Docking takes relatively little time and is often used for the preliminary screening of a large number of ligands. The main softwares include dock6, Autodock Vina, Glide and so on; molecular dynamics simulation calculation is time-consuming and time-varying for testing. It is used to further analyze the primary selection results of Docking. The main software includes Gromacs,Namd,Amber, etc. GPGPU acceleration effect is generally obvious.
drug research and development requires high-performance clusters with strong computing capabilities. How can we obtain these computing resources and services? With the rise of cloud Computing, obtaining Computing server services from the cloud has become a new approach. At the same time, Alibaba Cloud provides different products and services, such as cloud supercomputing product E-HPC(Elastic High Performance Computing), clusters share NAS, CPFS, and databases. The E-HPC cloud supercomputing product that allows users to build their own high-performance cluster systems on the cloud, configure high-performance servers and large-capacity storage, and provide solutions for multi-node software operation and high-throughput task processing, directly meet the needs of drug R & D personnel for the computing platform.
E-HPC cloud Super Computing
alibaba Cloud E-HPC cloud supercomputing is a cloud-native high-performance computing cluster solution that integrates Alibaba Cloud Computing Products (ECS/EGS/bare metal server/supercomputing cluster) and networks (VPC/RoCE) integrate with storage (NAS/OSS/CPFS) and other products, configure high-performance computing job management and account management, and integrate common HPC application software to enable users to operate on the page, obtain your own high-performance computing cluster and have the root permission to manage and configure the cluster.
In addition to functions, Alibaba Cloud provides a variety of computing instance types and computing capabilities (1vCPU, 2vCPU, 4vCPU... 104vCPU), different memory ratios (1vCPU:2GB, 1vCPU:4GB, 1vCPU: 8GB), or equipped with GPU or FPGA accelerator card, CPU type is mostly the latest Intel architecture. ECS Bare Metal ( ECS Bare Metal Instance) is a new type of computing server product based on the next- generation virtualization technology independently developed by Alibaba Cloud, with both the elasticity of virtual machines and the performance and functional features of physical machines, the Computing performance of the whole machine is released. Bare metal servers are equipped with RoCE high- speed networks that support RMDA and become SCC (Super Computing Cluster) products of Super Computing clusters, supports large-scale and high- concurrency applications.E-HPC high-throughput task solution
the high-performance computing environment provides a basic computing platform. To achieve efficient drug screening, a high-throughput task solution is also required. For example, when DOCK6 is used to process the docking case of ligand (small molecule) library, a large number of small molecule files are stored in a folder, such as mol2, and each small molecule processing process is the same, all need to be calculated with the same receptor (such as viral protease).
If serial processing is used, the code is shown in the following figure. dock.in is the input file of DOCK6 command, and the corresponding parameter values need to be modified according to the small molecule file name. This code traverses each molecular file in the mol2 folder, generates the corresponding dock.in input file for each file, and then runs the dock6 command to process the file.for molin in mol2/*; do
molin_name=`basename $molin`
cp dock.in $molin_name.dock.in
sed -ie "/^ligand_atom_file/cligand_atom_file
$molin" $molin_name.dock.in
sed -ie "/^ligand_outfile_prefix/cligand_outfile_prefix $molin_name" $molin_name.dock.in
dock6 -i $molin_name.dock.in -o $molin_name.dock.out
done
Serial execution takes a long time and cannot take advantage of the computing power of a high-performance cluster. How to process multiple nodes and multi-core concurrency in a cluster to achieve fast processing? There are also many implementation methods, such as manually dividing mol2 folder into several subfolders, each folder gets a small number of small molecule files, and then runs them in serial in Each subfile. This method requires too much manual participation, especially in scenarios where a task fails and needs to be resubmitted, which may lead to recalculation and missing calculation.
E-HPC definition and startup of high-throughput tasks
E-HPC provides high-throughput task solutions. In this case, a large number of small molecule files can be processed concurrently in three steps.
Write the script task.sh for processing a single small molecule file. Replace the small molecule file name with $molin. By comparing the serial logic, we can see that the processing code in the loop is directly copied.
molin_name=`basename $molin` cp dock.in $molin_name.dock.in sed -ie "/^ligand_atom_file/cligand_atom_file $molin" $molin_name.dock.in sed -ie "/^ligand_outfile_prefix/cligand_outfile_prefix $molin_name" $molin_name.dock.in dock6 -i $molin_name.dock.in -o $molin_name.dock.out
Run the E-HPC high-throughput task processing command ehpcarr to submit task.sh and return the job number 2[].manager. At this point, the task has been concurrently processed using 96 CPU core. If the number of CPU core contained in the node is less than 96, it is automatically allocated to multiple nodes. For example, if a 12-CPU core instance is used, all molecular processing tasks are run on eight nodes.
$ ehpcarr submit -w 96 ./task.sh molin 2[].manager
E-HPC high-throughput task status query
run the ehpcarr command to query the concurrent execution of a task based on the job number. You can obtain the current processing status of each task from the query results, including DONE, RUNNING, FAILED, and INIT, the enlightening end time of each task processing. You can estimate the computing resources to be used next time based on the task execution time. From the query results, we can see that: • E-HPC job scheduler starts eight nodes for drug screening; • different tasks are assigned to different computing nodes (task 0 is assigned to compute001,10520 the task is assigned to compute008); • the same node has different concurrent tasks (0,111 are processed concurrently in compute001).
$ ehpcarr status 2[].manager
E-HPC solution is an array job based on the high-performance cluster job scheduler and has been enhanced: • limit the number of concurrent tasks to avoid a large number of queued jobs caused by one task and one job, affects the running of jobs of other cluster users; Can realize dynamic scheduling of tasks and make full use of computing resources.
under the new coronavirus epidemic, the sharing of resources and research results can greatly accelerate the progress of researchers and avoid repeated work. The Global Health Drug Research and Development Center (Global Health Drug Discovery Institute, referred to as "GHDDI") is an independent operation co-founded and built by Bill and Mei Linda Gates Foundation, Tsinghua University and Beijing municipal government, A non-profit New drug research and development institution. GHDDI has built an open sharing platform on Alibaba Cloud. It uses E-HPC to build high-performance computing clusters for simulated computing of drug research and development. In addition, GHDDI creates different cloud Super operator accounts for partners to share computing resources. In addition, to share and publish the computing results E-HPC the cloud supercomputing cluster, attach the Object Storage Service (OSS) to the E-HPC supercomputing cluster and put the results to be published on OSS. In addition, create an ECS computing server on the cloud to build a web server [4] and place the OSS access link on the web server for browsing and downloading.
drug research and development requires high-performance computing clusters with strong computing capabilities. For example, drug screening requires Docking of a large number of small molecules. Scientists can use Alibaba Cloud E-HPC cloud supercomputing products to quickly build high-performance clusters on the cloud and obtain high-performance computing instances to meet computing power requirements. At the same time, E-HPC provides a high-throughput task processing solution, which enables drug screening to be processed concurrently on multiple computing nodes and multiple cores, reducing the overall task execution time. In addition, E-HPC is a cloud-native supercomputing product, so it can be connected with other cloud products, such as Object Storage Service (OSS), and can easily and quickly build computing and information publishing platforms.
[1] https://zh.wikipedia.org/wiki/%E7%97%85%E6%AF%92#cite_note-2[2] Zumla, A., Chan, J., Azhar, E. et al. Coronaviruses - drug discovery and therapeutic options. Nat Rev Drug Discov 15, 327-347 (2016). https://doi.org/10.1038/nrd.2015.37[3] https://zh.wikipedia.org/wiki/%E9%9D%B6%E7%82%B9_(%E7%94%9F%E7%89%A9%E5%AD%A6)[4] https://ghddi-ailab.github.io/Targeting2019-nCoV/
Start Building Today with a Free Trial to 50+ Products
Learn and experience the power of Alibaba Cloud.
Sign Up Now