Security Engineering in Business Systems

Part 20 of this 27-part series discusses some theoretical research and tools to prevent the enterprise system from being affected by external factors.

This article is from Alibaba DevOps Practice Guide written by Alibaba Cloud Yunxiao Team

Enterprises are moving towards digitalization, and information systems begin to play an increasingly important role in businesses. The earliest information system mainly supported business development and improved business operation efficiency. Then, it gradually developed into a business to provide commercial values and cost advantages. With the implementation of the national digitalization policy, the role of information systems has become more prominent, leading business development and providing competitive advantages

Challenges

As businesses become increasingly dependent on information systems, the stability and security of information systems become increasingly prominent. The main pain points are detailed in the following sections.

System Problems Occurring Frequently

As more users obtain services through information systems, the impact and losses are huge once the systems fail. Large-scale failures occur at regular intervals on the market, causing a serious impact on the business. In 2020 alone, the following failures occurred:

On February 23, Weimob's database was maliciously deleted. The service stopped for seven days, and the market value evaporated by more than 1 billion yuan.
On May 13, Tesla's service system went down, and its market value evaporated by 280 billion yuan overnight.
On June 3, Apple's iCloud cloud storage server failed, and users could not log in.
On August 27, Cisco employees removed virtual machines, causing Cisco to lose 16 million yuan.
On December 25, Google's services went down globally.

Frequent failures will lead to economic losses, affecting users' service usage and user experience, and hurt the credibility of enterprises, especially in some industries with high security requirements.

Complex, Difficult to Control Factors

As the business expands, the systems supporting the business become more complex. This complexity can be reflected in the system logic and expanded R&D teams. For example, system failures could be caused by changes and configurations made by R&D and O&M personnel, equipment failures, misoperations, malicious damage, and program bugs. All of these have led to the increasing difficulty in fault control. What's worse, some faults cannot be fully controlled or occurred more frequently after being controlled.

Managers Lacking a Sense of Security

According to a survey by Gartner Group, at least 40% of the companies stopped operation after information systems experienced major failures, while 1/3 of the remaining companies went bankrupt within two years. The security and stability of the systems have become one of the life-or-death things for enterprise security. Various failures and many potential system threats make enterprise managers extremely insecure.

Solutions

The causes of system failures are complex, which makes it hard to use the single-point control method to resolve the problems. A systematic solution is required. During the first Double 11 Global Shopping Festival, development and O&M engineers needed to stay up all night to solve problems that could occur at any time. Also, some failures still occurred unexpectedly. In 2020, the number of users and the sales scale of the festival were not the same as the first festival. The system is also more complex. However, the system guarantee process is smoother, and the number of users guaranteed is continuously decreasing. All of those contribute to a systematic solution.

Top-Level Organization Design

Organizational design means setting up special institutions from the organizational level to take charge of the system stability and security, including the production safety committee at the highest level and stability leaders of various R&D departments. The responsibilities of the production safety committee include making decisions about the overall stability, formulating production safety rules, coordinating overall emergencies, cultivating security culture, and planning and managing the global management and control system. When a fault occurs, relevant personnel are responsible for emergency response and coordination. The stability leader of each R&D department is responsible for system risk handling and stability assurance to avoid system faults during R&D and O&M processes.

Pre-Event Risk Prevention

Nipping risks in the bud is the highest capability to be safe. Pre-event risk prevention actions include analyzing various components of a system and possible threats and vulnerabilities of the components and taking the analysis results as the input for security governance. Formulate corresponding measures to avoid or reduce threats. Vulnerabilities need to be consolidated specifically. For example, a unified change platform for system configuration change operations that often cause system failures must be built to centrally manage various configuration change requests and implement centralized control of configuration change operations. In addition, the principle of the minimum privileges is used to limit the operation privileges of operators, including the operation time, operated object, and operation scope. Moreover, for each configuration change operation, the system can calculate the potential risks according to elements such as the operator, the operated object, and the operation type. Once certain risks are detected in the process, the current operation is blocked directly. If the risks are high level, a cross-confirmation process is initiated. If the risks are low level, the operation is directly allowed. This way, real-time risk management and control can be implemented, the faults caused by human errors can be avoided, and the relationship between R&D efficiency and production safety can be balanced.

Real-Time Monitoring during System Runtime

Rapid fault detection is an important method to prevent more losses. When the system is running, we can detect the problems existing in the system by monitoring business metrics, application programs, and cloud resources. Once a fault is found, the system will notify relevant personnel to handle it according to the plan formulated in advance. In addition, based on big data and artificial intelligence algorithms, the platform will predict the changing trend of relevant metrics in real-time and send alerts earlier again.

Post-Event Quick Recovery

Detailed plans have been formulated in advance, but it is still difficult to avoid the occurrence of faults. Once a failure occurs, quick recovery from the failure is the first thing that needs to be done. According to the different types of faults, the available fault recovery methods include throttling, interception, circuit breaking, quick recovery, degradation, capacity expansion, traffic switchover, and system restart. Different recovery methods require proper system support and routine drills and tests.

After the fault is recovered, the production safety committee also needs to organize relevant personnel to investigate and analyze the cause of the fault, formulate a rectification plan, determine the person responsible for the fault, and promote and implement the rectification plan to prevent the same fault from happening again.

The section above shows that it is difficult for enterprises to rely on a single means to solve system failures. Instead, they need a systematic solution with top-level organizational design, early risk analysis and strategy formulation, continuous monitoring and alerting in system runtime, and daily drills and emergency response afterward.

In the traditional industry, China has formulated a series of measures to ensure normal business activities and make the production process in accordance with the prescribed material conditions and working procedures. This can eliminate or control dangerous and harmful factors, reduce personal injuries and property losses, ensure personnel safety and health, and prevent facilities and the environment from damage. Production safety has been relatively mature and complete in the construction, petrochemical, transportation, aerospace, and other industries, but it is still blank in the Internet field. The following figure is an example of the production safety process in the mining industry. We can see that the management requirements for production safety have been implemented in all processes and stages of the operation.

We proposed a business system security engineering solution by referring to the safety production solutions in the traditional industry and combining the best practices of Alibaba. This solution is a security guide to help the business system prevent failures. Its goal is to reduce business system failures through prevention, monitoring and early alerting, emergency response, and other means. It ensures the stability, availability, and reliability of the business system and prevents asset losses and user impacts caused by system failures.

Security Engineering Framework for Business Systems

Due to the complexity of business systems and the failure causes, it is difficult to solve problems simply from one or more perspectives. Guided by the control theory and system theory, the business system security engineering solution uses risk control methods as tools to form its implementation framework IPDRI (identify, protect, detect, recover, and improve). The solution aims to control pre-event, runtime, and post-event risks to form a closed-loop feedback network.

Identification includes asset analysis, threat identification, and vulnerability identification. Prevention refers to taking certain preventive measures to avoid the occurrence of risks. Monitoring means monitoring whether the system and protection measures are being performed normally. Recovery means quickly taking measures to restore the system when a failure occurs. Improvement means finding the cause of the failure and formulating an improvement plan to avoid the recurrence of the same failure.

Security Engineering Standards for Business Systems

Against this backdrop, Alibaba Cloud and CAICT drafted the cloud computing-based digital business security engineering standards. They are the first industry standards in China to focus on protecting the continuous and normal operation of systems. The core objective of the standards means protecting the business systems from asset loss, reducing user impact caused by business system failures, and guaranteeing the availability, stability, and reliability of the systems.

The standards specify various capabilities that enterprises need to ensure the continuous and normal running of their business systems, including organizational design, risk analysis and identification, strategy and control, monitoring and alerting, and emergency response.

All the capabilities are described below:

Organizational design stipulates that enterprises shall set up a top-level production safety committee and a subordinate production safety department, use technical means to improve the risk control ability and ensure business stability, create a production safety culture to ensure that everyone has awareness and make continuous improvements, clearly define the code of conduct, protect people with mechanisms from making mistakes, and reduce losses. This will allow enterprises to quickly promote stability management and significantly mitigate global failures and major failures.
The risk analysis and identification module helps enterprises analyze system vulnerabilities, production safety requirements, and existing system failures to identify the potential risks that affect the production safety of information systems.
The strategy and control module is used to formulate production safety control strategies for the risks detected through analysis. This module prevents risks by reducing or preventing threats and consolidating or eliminating vulnerabilities in advance.
The monitoring and alerting module quickly detects risks through capabilities, such as business status monitoring, cloud resource status monitoring, big data risk analysis and alerting, and alert management.
The emergency response module provides enterprises with the response and recovery capabilities that must be available to shorten failure time and quickly recover the system. Disaster recovery drills, traffic switchover, throttling, interception, degradation, restart, and scale-out are included.

Summary

System security is affected by internal and external factors. In terms of preventing the enterprise system from being affected by external factors, there is sufficient theoretical research and tools. Currently, system failures are mostly caused by internal factors. The information system security engineering solution is a systematic solution to reduce system failures. In the future, theoretical research on this solution and corresponding products and services will be developed rapidly.

Community

Security Engineering in Business Systems - Alibaba DevOps Practice Part 20