Experts from Alibaba Cloud, Byte and Huake said about O&M

Only today's thorough "operation" strategy can maintain stability in the future. Not long ago, Alibaba Cloud, together with the Information Storage and Security Professional Committee of the China Computer Industry Association, invited many experts from Alibaba Cloud, ByteDance and Huazhong University of Science and Technology to discuss the operation and maintenance of storage systems in the digital economy era.

1、 Reduce latency and avoid dramatic changes in system performance

The essence of operation and maintenance is to reach an acceptable state in terms of cost, stability and efficiency for the operation and maintenance of networks, servers and services at all stages of their life cycle. In the ICT industry, the operation and maintenance personnel often tease that "operation and maintenance is a commitment to applications, and they will never leave for a lifetime". They are like stewards, security guards and fire fighters of IT resources in the data center and the company.

Luo Qingchao, senior technical expert of Alibaba Cloud Intelligence and research and development director of object storage, has a deep understanding of this. He recalled the past of Alibaba Cloud key customers requesting delay jitter protection, and pointed out that when the delay jitter of cloud storage service requests was severe, the overall performance of the application would have a roller coaster change.

The request latency on the cloud includes network latency and storage latency. The cloud service network is very complex, including BGP (Border Gateway Protocol), static public networks and networks in the data center. Finding out the congestion points that affect the delay and reasonable scheduling are crucial to avoid congestion.

The storage service should also handle the delay of media access. Mechanical/solid disk is also a complex system. The greater the pressure, the higher the delay. Especially in the distributed storage system, it will also bring viral infection. In order to reduce delay jitter, OSS controls delay jitter within a reasonable variance range from rapid monitoring, accurate alarm, root cause analysis, and optimal scheduling, ensuring a good customer experience.

Wu Fei, a researcher and doctoral supervisor of Huazhong University of Science and Technology, said with a smile that since he came from a university, he did not feel the pressure from the operation and maintenance personnel, but could understand the difficulty of 7 * 24 standby, which is comparable to perpetual motion machine. At present, the reliability requirement of cloud storage is 11 nines. Solid state disk and traditional mechanical disk are the most basic data storage units in cloud storage, and maintenance is not simple. The storage medium of the former is composed of flash memory. In principle, flash memory is like a door. Every time it is opened, it will be worn out. During the use process, aging will inevitably occur, gradually creaking, and failures will follow; The latter keeps swinging like a mechanical machine, but it will eventually stop rotating. In the storage system composed of thousands of solid state disks or hard disks, the pressure of operation and maintenance personnel can be seen to ensure such high reliability.

2、 The trend of intelligent operation and maintenance is changing with time

"If you want to improve the new infrastructure, you must first benefit the operation and maintenance". In the process of enterprise digitalization, operation and maintenance is an important part.

Zhang Lei, the head of ByteDance database storage technology, said that the O&M technology has achieved leapfrog development in the past decade, from the traditional manual O&M, to the automatic O&M of DevOps, and then to the intelligent O&M of AIOps. The development process of the entire operation and maintenance system of byte cloud database cloud storage is also roughly divided into three stages.

In the first stage, before 2016, the overall database and storage volume were not very large, and the team operation and maintenance was still in the "knife and stone" state, that is, it was basically done manually.

The second stage is from 2017 to 2021. The business scale is developing rapidly, the cloud storage system has reached the EB level, the database size is thousands or even tens of thousands of sets of databases, and the ceiling of manual operation and maintenance has appeared. Therefore, the operation and maintenance team has turned to build some automated operation and maintenance platforms to solve the operation problems relying on these platforms.

The third stage is the construction of the third generation operation and maintenance system relying on AI and other technologies from the middle of 2021. Combine the knowledge and experience of operation and maintenance personnel with big data and machine learning technology, and integrate them into the operation and maintenance system to replace manpower, so as to solve the problem of operation efficiency on a larger scale.

In these three stages, the development of the entire business system has shown two capacity leaps: on the one hand, it is the improvement of the culture, organization and capability of operation and maintenance. The popular understanding is that everyone moves forward in the dark, from individual operation and maintenance to systematic and systematic construction of a full-time SRE team for operation and maintenance; On the other hand, the entire operation and maintenance system and some technical systems of service are also advancing. For example, from the earliest management of dozens of servers to the current management of hundreds of thousands of servers, the technical system is constantly evolving to support. In a word, the culture and organization of O&M and the technical system of O&M go hand in hand.

3、 Quickly locate and diagnose the root cause of the problem

As the business goes to the cloud, the operation and maintenance is also gradually "cloudized". Operation and maintenance services such as resource monitoring, terminal management and control, and security support are transformed into cloud applications, and enterprises can subscribe as required.

Zhang Lei said that he usually pays attention to the gold indicators of services, especially those related to stability, because for large online services, stability may be the first. In addition, he pays more attention to some technological evolution paths of the services he relies on for a long time, so as to take precautions and ensure that the operation and maintenance/operation system will not fall behind when there are major changes in technology or product form.

Luo Qingchao pointed out that Alibaba Cloud OSS, as a service provider, should meet the SLA (service level agreement) and SLO (service level objective) of service commitment. In detail, the OSS official website promises that the SLA of availability is 99.995%, which is the industry leading. As a service provider, it will definitely measure the success rate of requests according to the specified standards and do everything possible to ensure this indicator. SLO is a more detailed service commitment, for example, to ensure that the overall bandwidth of customers' requests can reach a stable level of Tbps, and that some typical request delays can be guaranteed at the level of 100ms without too much fluctuation.

Recently, Alibaba Cloud will also release an observable service, CloudLens, which will provide customers with operation and maintenance knowledge of mainstream cloud products. CloudLens provides OSS with functions such as usage analysis, performance monitoring, security analysis, data protection, exception detection, and access analysis, so as to support customers' management capabilities in six dimensions: cost, performance, security, data protection, stability, and access analysis.

Wu Fei believes that in order to support the rapid development of applications, storage technology is also evolving. From traditional disk arrays to centralized storage, and now there may be tens or tens of thousands of distributed storage servers in the system. Technically, we need to consider how to ensure that thousands of servers can run reliably. From the perspective of operation and maintenance, it means no or less failures, or rapid detection of failures to achieve rapid repair, rapid recovery, rapid detection and other indicators.

In recent years, AI has been developing in full swing. University researchers are also doing research on AI to predict system failures in advance, hoping to complete data migration before system failures occur, so as to effectively reduce the pressure on operation and maintenance.

4、 Industry, education, research and use to build a growing community

The support provided by operation and maintenance for business systems is inseparable from the layout of service providers such as Alibaba Cloud and the efforts of product users such as Byte. As the main body of basic theoretical technology and frontier technology research, universities and research institutes have deep basic technology reserves and rich theoretical research foundation in many key frontier technologies. Therefore, industry university research cooperation innovation is a sector that needs to be paid attention to in industrial development.

Wu Fei said that it is appropriate to define such a partnership with "common growth body", which includes innovation chain, industrial chain and user chain. It is precisely because of such alliances that users and R&D parties are linked together to promote each other's development. In popular terms, production, learning, research and application are integrated, and all parties grow together to promote the development and implementation of technology.

For example, when colleges and universities study the reliability of cloud storage, they propose a new algorithm. In the process of implementing and applying the algorithm, they may need to cooperate with enterprises such as ByteDance and Alibaba Cloud to deploy the algorithm on the actual system to promote industrial development.

Wu Fei also mentioned that cross-border innovation in the industry, university and research sector has become an important part of the career development planning of university experts and scholars. Many experts and scholars choose to return to the academic world again after the industry is committed to promoting the implementation of technology. This is called "academic leave". She believes that in the future, there will be further integration between academia and industry.

Zhang Lei believes that the integration of industry, university and research is an important driving force behind the technology from its birth to its wide application. In recent years, some technologies of cloud storage system have been solidified. He first hoped that the academia and research community could bring more breakthroughs in the field of infrastructure: breakthroughs in storage media, the entire cloud storage architecture, or some systems, operation and maintenance ideas, and methods can bring new vitality to the industry. Secondly, the industry should also strive for perfection, boldly try new technologies, new methods and new ideas, and integrate them into appropriate scenarios. Because large enterprises in the industry, such as ByteDance, have large technical volume, servers, and quantity of storage. In fact, there is a very good technical leverage effect. Even if it looks like a very small technical optimization, it can produce great value in a large scene. Therefore, it is necessary for all parties to support each other.

Luo Qingchao pointed out that Alibaba Cloud, as a service provider, has two core points for common growth: one is to provide base services for common operation and maintenance capabilities, and the other is to absorb some inputs and advanced ideas from customers, the industry and academia to help the base grow.

As for the combination of production, education and research mentioned by the two guests, Luo Qingchao said that there are two stages in the evolution of common growth that may be very important. In the first stage, such organizations as CCIA provide a soil and ecology for common growth. If the CCIA is operated well, it can lay a solid foundation for the common growth of operation, maintenance and technology. In the second stage, the co growth body must make achievements. For example, CCIA is an organization that builds communication bridges and incubates some standard white papers or technical innovation ideas that are influential in the industry.

Conclusion: With the extension of university functions from talent cultivation, scientific research to social services, the cooperation between enterprises, associations and universities will be further deepened, which is undoubtedly conducive to forming a virtuous circle of development and promoting the marketization of storage technology achievements. In this process, both users and manufacturers will benefit greatly.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us