Case sharing of EDA cloud migration of a leading IC design company

1、 Customer project background

1. Customer introduction

In recent years, as new technologies such as 5G, the Internet of Things and artificial intelligence have become mature and commercially available, the development of smart cities, autonomous driving, intelligent manufacturing and other fields has accelerated, and the semiconductor industry has experienced a wave of rapid growth.

The client of this project is a world-famous semiconductor design company, which is one of the few enterprises in the world that fully master 2G/3G/4G/5G, Wi-Fi, Bluetooth, TV FM, satellite communication and other full-scene communication technologies. Its products include mobile communication central processing unit, baseband chip, AI chip, RF front-end chip, RF chip and other communication, computing and control chips. Its business covers hundreds of countries around the world.

2. Project background

Moore's law continues to play a role in the continuous improvement of chip processing, and the number of transistors per unit area has doubled every 18-24 months, which means that the computational power required for chip R&D and design has also increased. At the same time, IC design enterprises pay close attention to the efficiency of research and development. One day earlier, the chip will be streamed, which means that it will be invested in the market and begin to make profits one day earlier; However, projects that cannot be listed on time may miss the best window period and thus lose market opportunities.

Customers have realized that the traditional offline computing power deployment method can no longer meet the needs of rapid business growth, but they are still very cautious about the decision to design and develop cloud.

3. Main challenges and concerns faced by customers

A. Challenges of traditional offline deployment:

① Insufficient computing power and lack of flexibility

Customers have set up large data centers offline, with thousands of servers, but they still cannot meet the needs of the R&D department. Especially after research and development from front-end logical design to back-end physical design, the demand for computing power has multiplied; If you encounter some bugs and need to rerun the task, the job queuing phenomenon is serious. However, due to the room space and power consumption indicators, the expansion potential of the customer's offline data center has been exhausted;

② Long lead time, affecting R&D progress

Before going to the cloud, IT departments need to go through a series of steps, such as project approval, procurement, bidding, arrival, deployment, etc., ranging from three months to six months. The epidemic situation has exacerbated the uncertainty of the supply chain, making offline computing procurement more uncontrollable;

③ The operation and maintenance workload is heavy, and the IT department is under pressure

In the face of thousands of offline servers, the IT department has to spend a lot of manpower to carry out basic operation and maintenance, from computer room power, air conditioning, security, to hardware operation and maintenance, making the front-line operation and maintenance personnel overwhelmed;

④ Lack of effective control over the use of computing power

In order to save computing power and storage resources, IT departments need to take measures such as quota restrictions, utilization monitoring and statistics to urge R&D personnel to release resources in a timely manner. This process is also unpleasant.

B. The customer's concerns about the design and research on the cloud are mainly reflected in four aspects:

① Data security

The research and development of a chip requires a large amount of investment, so customers are extremely sensitive to the security of data. R&D on the cloud means that the data leaves the original physical boundary, and how to ensure that the data security is still controllable is the bottom line for customers to consider;

② Performance meets requirements

After long-term optimization, customers can maximize their offline computing power. Due to the lack of understanding of cloud computing, customers worry that virtualization, resource overselling, etc. may lead to the fact that the computing resources actually obtained cannot achieve the same performance as offline, affecting the operational efficiency of research and development work;

③ Business experience

Whether the workflow and usage habits established by the R&D personnel in the offline cluster can be seamlessly migrated to the cloud is the key to the successful promotion of the cloud project;

④ Input-output ROI

According to the preliminary calculation of the customer purchasing department, the cloud purchasing power, including the costs of special lines and security, has no price advantage over the offline purchasing machines.

2、 Alibaba Cloud EDA cloud solution

1. Give full play to the advantages of public cloud to solve customers' business challenges

First of all, compared with the customer's original offline computing power deployment model, the public cloud solution provided by Alibaba Cloud can perfectly solve the problems of insufficient computing power, long delivery cycle, large workload of operation and maintenance, and lack of control.

A. In terms of computing power supply

• Flexible computing power supply: relying on Alibaba Cloud's rich cloud resources and supply chain coordination capabilities, it can provide customers with sufficient supply guarantee;

• Not limited to machine room space, sufficient resource guarantee: Alibaba Cloud has up to 12 available zones in Shanghai, providing flexible computing and storage resources, and customers no longer need to worry about insufficient space in offline machine rooms;

B. In terms of lead time

• Minute-level delivery: Alibaba Cloud can achieve minute-level resource delivery;

• On-demand capacity expansion and out-of-the-box use: at the peak stage of design operations, customers can expand capacity on demand and out-of-the-box use, avoiding the lengthy links of equipment procurement, arrival, installation and deployment;

C. Operation and maintenance management

• Infrastructure O&M free: cloud services can free IT departments from inefficient basic O&M work;

• Unified console management: operation and maintenance personnel can easily manage and dispatch the full amount of resources through the unified console;

• Automatic deployment: One-click automatic deployment can be realized for the running environment, application software, scheduler agent, etc;

D. Operation control

• Integrated resource monitoring: Alibaba Cloud provides an integrated resource monitoring platform;

• Usage quota management and monitoring: realize all-weather and uninterrupted usage monitoring for computing, storage and other resources;

• Multi-dimensional performance analysis: help operation and maintenance personnel carry out refined resource management and control.

2. Through quantitative analysis and POC measurement, dispel customers' doubts about cloud

In response to customers' concerns about cloud, Alibaba Cloud has also successfully dispelled the doubts of various stakeholders within customers through POC testing, technical discussion, demonstration and analysis:

A. In terms of data security

• Data security commitment: Alibaba Cloud solemnly promises the security and privacy clauses designed by the project in the form of a formal contract, and does not touch customer data;

• Drop disk encryption scheme: at the technical level, it provides the ability for users to carry their own secret keys to the cloud and drop disk encryption to ensure that the ownership of data is firmly in the hands of customers;

• Security operation audit: Alibaba Cloud provides the ability of security operation audit. Customers can request to audit Alibaba Cloud's operation and maintenance logs of relevant cloud resources through work orders;

B. In terms of performance satisfaction

• Model specification and performance parameter benchmarking: Benchmark the specifications and performance parameters of offline models of customers, and adopt bare metal servers with high main frequency, large memory and local disk;

• Third-party tool pressure test: use the third-party pressure test tool to test the performance of the cloud instance. The actual pressure test shows that the computing storage performance provided by Alibaba Cloud can fully match the offline benchmarking, and some test items are even better than the offline. When the drop disk encryption is enabled, the performance loss of the cloud instance is generally less than 10%;

C. In terms of business use experience

• Adopt the scheduler consistent with offline operation;

• Adopt the same scheduling strategy as offline;

• Maximize the use habits of R&D personnel;

D. In terms of input and output

• By sorting out and analyzing the total cost of ownership (TCO) of research and development on the cloud, Alibaba Cloud's cloud elastic computing services can save the hidden costs of offline computer room construction, power, operation and maintenance, reduce the risk costs caused by machine failures, avoid the waste costs caused by machine limitations during the low peak period of business, and improve the opportunity costs caused by insufficient computing power during the peak period of business;

• For corporate finance, purchasing cloud services on demand can also turn CAPEX investment into OPEX expenses to improve the company's cash flow.

3. Alibaba Cloud EDA cloud solution introduction

Based on the characteristics and requirements of EDA business, Alibaba Cloud has customized the following solutions for customers: (see the figure below)

On the left side of the figure is the customer's offline computer room. The customer deploys a high-performance computing cluster. With the NetApp storage solution, the job scheduling uses the IBM LSF scheduler commonly used in the industry; On the right is the EDA zone of Alibaba Cloud East China II public cloud region, which is interconnected through two 10GB high-speed channels.

A. Machine room location:

Select the location of the computer room nearest to the customer to control the delay of data transmission in milliseconds;

B. Calculation nodes:

Elastic bare metal servers with high main frequency and large memory are provided as required. Bare metal servers can physically ensure that customers completely monopolize the machine resources, turn off overclocking and Remax from the BIOS level, cooperate with the MOC card technology independently developed by Alibaba Cloud to avoid virtualization losses, and ensure that each bare metal server displays the ultimate computing performance;

C. Storage

The parallel file system CPFS is adopted, which is characterized by high performance, high scalability and high reliability. A single cluster can be expanded to 9620 storage nodes at most, and can support 2.5TB/s throughput at most;

D. Cluster management

Alibaba Cloud E-HPC is used to uniformly control elastic bare metal computing nodes. E-HPC's scheduler plug-in can support the automatic deployment of LSF agents and seamlessly connect to E-HPC to provide corresponding node management, job management, automatic scaling and other capabilities.

3、 Features, highlights and advantages of Alibaba Cloud EDA cloud solution

Compared with the offline physical machine deployment of customers and the solutions provided by other manufacturers, the advantages of Alibaba Cloud EDA cloud solutions are reflected in the following four aspects:

1. Ultimate performance

Alibaba Cloud's high-frequency, large-memory bare metal server specially built for the chip design industry has excellent performance through actual measurement, fully meeting the extreme requirements of EDA software for computing power;

2. Visual architecture

CADT, the first cloud snap product created by Alibaba Cloud in the industry, is introduced to help customers create and manage cloud architectures. It can display cloud architectures with deployable architecture diagrams, clearly express the deployment relationship of various basic product components, and reduce the time cost of customer scheme design and evaluation stage;

3. Flexible deployment

With E-HPC's flexible deployment, flexible resources, and unified operation and maintenance capabilities, online computing power cluster management is simpler and more efficient;

4. Safety compliance

As an extension of the offline computer room, the cloud environment eliminates external attacks because there is no public network exit. The cloud security center can identify, analyze, and warn security threats in real time. The encryption of disk dropping can be used as the last line of defense for data protection.

