This article is from Alibaba DevOps Practice Guide written by Alibaba Cloud Yunxiao Team
Enterprises are moving towards digitalization, and information systems begin to play an increasingly important role in businesses. The earliest information system mainly supported business development and improved business operation efficiency. Then, it gradually developed into a business to provide commercial values and cost advantages. With the implementation of the national digitalization policy, the role of information systems has become more prominent, leading business development and providing competitive advantages
As businesses become increasingly dependent on information systems, the stability and security of information systems become increasingly prominent. The main pain points are detailed in the following sections.
As more users obtain services through information systems, the impact and losses are huge once the systems fail. Large-scale failures occur at regular intervals on the market, causing a serious impact on the business. In 2020 alone, the following failures occurred:
Frequent failures will lead to economic losses, affecting users' service usage and user experience, and hurt the credibility of enterprises, especially in some industries with high security requirements.
As the business expands, the systems supporting the business become more complex. This complexity can be reflected in the system logic and expanded R&D teams. For example, system failures could be caused by changes and configurations made by R&D and O&M personnel, equipment failures, misoperations, malicious damage, and program bugs. All of these have led to the increasing difficulty in fault control. What's worse, some faults cannot be fully controlled or occurred more frequently after being controlled.
According to a survey by Gartner Group, at least 40% of the companies stopped operation after information systems experienced major failures, while 1/3 of the remaining companies went bankrupt within two years. The security and stability of the systems have become one of the life-or-death things for enterprise security. Various failures and many potential system threats make enterprise managers extremely insecure.
The causes of system failures are complex, which makes it hard to use the single-point control method to resolve the problems. A systematic solution is required. During the first Double 11 Global Shopping Festival, development and O&M engineers needed to stay up all night to solve problems that could occur at any time. Also, some failures still occurred unexpectedly. In 2020, the number of users and the sales scale of the festival were not the same as the first festival. The system is also more complex. However, the system guarantee process is smoother, and the number of users guaranteed is continuously decreasing. All of those contribute to a systematic solution.
Organizational design means setting up special institutions from the organizational level to take charge of the system stability and security, including the production safety committee at the highest level and stability leaders of various R&D departments. The responsibilities of the production safety committee include making decisions about the overall stability, formulating production safety rules, coordinating overall emergencies, cultivating security culture, and planning and managing the global management and control system. When a fault occurs, relevant personnel are responsible for emergency response and coordination. The stability leader of each R&D department is responsible for system risk handling and stability assurance to avoid system faults during R&D and O&M processes.
Nipping risks in the bud is the highest capability to be safe. Pre-event risk prevention actions include analyzing various components of a system and possible threats and vulnerabilities of the components and taking the analysis results as the input for security governance. Formulate corresponding measures to avoid or reduce threats. Vulnerabilities need to be consolidated specifically. For example, a unified change platform for system configuration change operations that often cause system failures must be built to centrally manage various configuration change requests and implement centralized control of configuration change operations. In addition, the principle of the minimum privileges is used to limit the operation privileges of operators, including the operation time, operated object, and operation scope. Moreover, for each configuration change operation, the system can calculate the potential risks according to elements such as the operator, the operated object, and the operation type. Once certain risks are detected in the process, the current operation is blocked directly. If the risks are high level, a cross-confirmation process is initiated. If the risks are low level, the operation is directly allowed. This way, real-time risk management and control can be implemented, the faults caused by human errors can be avoided, and the relationship between R&D efficiency and production safety can be balanced.
Rapid fault detection is an important method to prevent more losses. When the system is running, we can detect the problems existing in the system by monitoring business metrics, application programs, and cloud resources. Once a fault is found, the system will notify relevant personnel to handle it according to the plan formulated in advance. In addition, based on big data and artificial intelligence algorithms, the platform will predict the changing trend of relevant metrics in real-time and send alerts earlier again.
Detailed plans have been formulated in advance, but it is still difficult to avoid the occurrence of faults. Once a failure occurs, quick recovery from the failure is the first thing that needs to be done. According to the different types of faults, the available fault recovery methods include throttling, interception, circuit breaking, quick recovery, degradation, capacity expansion, traffic switchover, and system restart. Different recovery methods require proper system support and routine drills and tests.
After the fault is recovered, the production safety committee also needs to organize relevant personnel to investigate and analyze the cause of the fault, formulate a rectification plan, determine the person responsible for the fault, and promote and implement the rectification plan to prevent the same fault from happening again.
The section above shows that it is difficult for enterprises to rely on a single means to solve system failures. Instead, they need a systematic solution with top-level organizational design, early risk analysis and strategy formulation, continuous monitoring and alerting in system runtime, and daily drills and emergency response afterward.
In the traditional industry, China has formulated a series of measures to ensure normal business activities and make the production process in accordance with the prescribed material conditions and working procedures. This can eliminate or control dangerous and harmful factors, reduce personal injuries and property losses, ensure personnel safety and health, and prevent facilities and the environment from damage. Production safety has been relatively mature and complete in the construction, petrochemical, transportation, aerospace, and other industries, but it is still blank in the Internet field. The following figure is an example of the production safety process in the mining industry. We can see that the management requirements for production safety have been implemented in all processes and stages of the operation.
We proposed a business system security engineering solution by referring to the safety production solutions in the traditional industry and combining the best practices of Alibaba. This solution is a security guide to help the business system prevent failures. Its goal is to reduce business system failures through prevention, monitoring and early alerting, emergency response, and other means. It ensures the stability, availability, and reliability of the business system and prevents asset losses and user impacts caused by system failures.
Due to the complexity of business systems and the failure causes, it is difficult to solve problems simply from one or more perspectives. Guided by the control theory and system theory, the business system security engineering solution uses risk control methods as tools to form its implementation framework IPDRI (identify, protect, detect, recover, and improve). The solution aims to control pre-event, runtime, and post-event risks to form a closed-loop feedback network.
Identification includes asset analysis, threat identification, and vulnerability identification. Prevention refers to taking certain preventive measures to avoid the occurrence of risks. Monitoring means monitoring whether the system and protection measures are being performed normally. Recovery means quickly taking measures to restore the system when a failure occurs. Improvement means finding the cause of the failure and formulating an improvement plan to avoid the recurrence of the same failure.
Against this backdrop, Alibaba Cloud and CAICT drafted the cloud computing-based digital business security engineering standards. They are the first industry standards in China to focus on protecting the continuous and normal operation of systems. The core objective of the standards means protecting the business systems from asset loss, reducing user impact caused by business system failures, and guaranteeing the availability, stability, and reliability of the systems.
The standards specify various capabilities that enterprises need to ensure the continuous and normal running of their business systems, including organizational design, risk analysis and identification, strategy and control, monitoring and alerting, and emergency response.
All the capabilities are described below:
System security is affected by internal and external factors. In terms of preventing the enterprise system from being affected by external factors, there is sufficient theoretical research and tools. Currently, system failures are mostly caused by internal factors. The information system security engineering solution is a systematic solution to reduce system failures. In the future, theoretical research on this solution and corresponding products and services will be developed rapidly.
Alibaba Cloud Community - February 4, 2022
Alibaba Cloud Community - February 4, 2022
Alibaba Cloud Community - February 14, 2022
Alibaba Cloud Community - March 2, 2022
Alibaba Cloud Community - February 3, 2022
Alibaba Cloud Community - February 6, 2022
A unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.Learn More
Managed Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.Learn More
More Posts by Alibaba Cloud Community