By Alibaba Cloud E-HPC Team
When Alibaba Cloud started public beta testing of Elastic High Performance Computing (E-HPC) in September 2017, a simulation enterprise customer began to request E-HPC services. This was because the customer had been using some cloud computing products for business needs and faced development issues of the traditional manufacturing industry.
After initial communication with the customer, we found that the customer's major pain points were as follows: The simulation enterprise's revenues were from the traditional manufacturing industry, so its major customers were mainly automakers, aerospace companies, and shipyards. These customers had different requirements for computing power at different stages. Sometimes, if they had large simulation demands, the simulation enterprise might fail to meet production requirements due to small machines in its own equipment room. In general, customers' requirements for computing power kept changing throughout the year.
When E-HPC was launched, it provided scale-in and scale-out features for computer clusters established on the cloud. This required that computing nodes newly created on the cloud must be able to run high-performance software stacks like other computing nodes, and also have a Portable Operating System Interface (POSIX) account system so that the cluster job scheduler could schedule user-submitted jobs to these newly created nodes. At the beginning, the simulation enterprise customer quickly created a cluster based on Elastic Compute Service (ECS). During initial operation, the cluster required dozens of computing cores for running computing instances. After the cluster was scaled out through E-HPC to increase computer cluster nodes, it was able to run computing jobs based on a structure with more cores and flow solvers.
When providing simulation services for manufacturing enterprises, the customer was able to estimate the computing cores and time required for running some computing instances. However, the customer could not determine the resources required by some complex computing instances. Therefore, the customer hoped that computer clusters provided by E-HPC could adapt to the number of jobs submitted to the customer's simulation system and the actual number of computing cores required for running jobs. In other words, the customer wanted to maximize the efficiency of each CPU cycle and hoped that E-HPC could help automatically increase and decrease computing nodes for computer clusters. For this reason, the Auto Scale feature was launched to dynamically scale in and scale out a computer cluster based on the load and policies of the entire high-performance cluster.
Generally, in the simulation workflow, a large number of simulation computing jobs are followed by a rendering stage. The computing results will pass through the pipeline of the GPU server cluster before they are displayed on Cloud Desktop to the simulation enterprise's customers. Therefore, E-HPC began to support custom images, allowing the customer to start GPU instances from images with featured rendering software to complete follow-up processing. E-HPC also provided a scale-out method that supports spot instances to help the customer complete some stateless training tasks at a lower cost.
In pursuit of extreme computing, the annual top 500 traditional high-performance computing services have obvious characteristics in terms of computing, storage, and network. In computing, high-frequency processors running at 3 GHz to 4 GHz are used for the computing nodes of high-performance computing clusters. In storage, traditional enterprise-class disk arrays are generally used. The reliability of the storage system relies on the fault tolerance capability of disk arrays, but rarely depends on a multi-copy scheme. In network, because traditional applications mostly use a parallel algorithm based on synchronous communication, a low-latency remote direct memory access (RDMA) network or even a customized network is usually used to achieve a large speed-up ratio.
At the beginning of 2018, Alibaba Cloud Super Computing Cluster (SCC) began its public beta testing. By providing computing, storage, and network infrastructure that can run super-computing applications, SCC can provide a nearly linear speed-up ratio for finite element analysis (FEA) software such as fluid simulation. The customer quickly completed the proof of concept (POC) testing for ECC, with the help of the elasticity provided by E-HPC.
From the preceding figure, we can see that SCC has helped significantly increase both the single-node computing power and the multi-node speed-up ratio during FEA of tens of billions of elements. The customer gave feedback on the test as follows:
After years of simulation service practices, the customer has developed a simulation system that integrates commercial software commonly used in the manufacturing simulation industry, covering collision simulation, fluid dynamics, mechanical structure, and electromagnetic simulation.
The simulation system provides a unified portal for different manufacturing enterprises to complete the simulation workflow with consistent experience. The early system structure is shown in the following figure. We can see that the early structure is based on traditional super-computing architecture and integrates computer-aided engineering (CAE) parallel computing, computing resource scheduling, software and hardware resource management, remote graphical desktops, CAE professional applications, and other technologies, so as to provide simulation computing services for users. It was costly for the customer to own the infrastructure as means of production to serve its own customers. After communication, we knew that the customer was specialized in simulation but not professional in operating the IT infrastructure, which was just to support its simulation production system. The customer wanted to migrate the IT infrastructure to the cloud so that it could focus on simulation services.
By migrating the simulation system to the cloud, the customer expected to achieve the following effects:
After gradual verification, the customer has simplified the structure of the simulation system on Alibaba Cloud, as shown in the following figure.
From the preceding figure, we can find that the customer now focuses more on the simulation workflow and uses IT infrastructure simply by calling various open APIs on Alibaba Cloud. When a cluster is required, the customer can create an SCC through an open API. When the computing power is insufficient, the customer can create a computer cluster through an open API. When there are no jobs, the customer can release a computer cluster through an open API. When the customer does not want manual operations, clusters can be automatically scaled in or out through an open API. The customer no longer needs to concern about equipment room building, stock-up, expansion, and equipment maintenance.
With continuous development and maturity of industrial simulation technologies and increasing complexity of industrial products, most industrial simulation is conducted under various complex physical environmental conditions. In this case, industrial simulation requires a large amount of computing work and high-performance storage resources, as well as a highly automated simulation process that can quickly create and access simulation models and data. Industrial simulation technologies begin to play a role at the early stage of the product R&D process, and they are no longer used for backend verification after product design is completed. In addition, industrial simulation technologies become more and more important in the downstream of the product lifecycle, for example, they can be used for analysis of real-time operational data from machines on the industrial Internet of Things (IoT). Therefore, computing resources, talent training, and environmental construction required for industrial simulation are becoming increasingly difficult. It is not easy for simulation enterprises to build an environment and train professional simulation engineers. They may have to spend several months in studying the demands for software and hardware purchase. After that, they have to invest heavily in professional simulation training and application deployment.
Like other enterprise-class IT applications, simulation applications are greatly changed by cloud computing technologies. On the simulation cloud platform, enterprises can design, improve, and innovate products, quickly verify models, and compare schemes. In the traditional manufacturing industry, the biggest value of cloud computing technologies is to free enterprises from purchasing and managing physical computing clusters, so as to allow them to change the traditional simulation process and focus on simulation services. Based on cloud computing technologies, enterprises can use software with more flexible prices and perform modeling anytime and anywhere to solve complex simulation application problems. With the ability to simultaneously simulate multiple design schemes, simulation based on cloud computing technologies can support the traditional manufacturing industry in easier product design and engineering simulation. By means of simulation on Alibaba Cloud, enterprises can quickly acquire elastic resources and perform a complete simulation production process in a short time. No matter whether enterprises want to accelerate product innovation, meet the growing simulation demands of the manufacturing industry, or strengthen global collaboration to increase the return on IT investment, E-HPC can achieve immediate results.
About the Author
Mu Hui, a virtualization platform expert from Alibaba Cloud, specializes in E-HPC and has received several patent certificates and awards in virtualization technology.
Alibaba Clouder - November 5, 2018
Alibaba Cloud ECS - September 13, 2018
Alibaba Cloud ECS - March 12, 2019
Alibaba Cloud ECS - March 12, 2019
Alibaba Clouder - June 21, 2018
Alibaba Cloud ECS - March 12, 2019
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
Super Computing Service provides ultimate computing performance and parallel computing cluster services for high-performance computing through high-speed RDMA network and heterogeneous accelerators such as GPU.Learn More
A HPCaaS cloud platform providing an all-in-one high-performance public computing serviceLearn More
Powerful parallel computing capabilities based on GPU technology.Learn More
More Posts by Alibaba Cloud ECS