How to design a real-time security baseline engine in the field of information security

1. What is security analysis

In the past 10 years, big data technology has developed rapidly. Security analysis scenarios and methods are also updated. At present, the commonly used security analysis process is roughly divided into three processes:

* Log collection and analysis. Collect various traffic logs, threat logs, operation logs, and terminal logs and other data from servers, gateway devices, terminals, databases, or other data sources through various methods;

*Real-time security analysis. The analysis process mainly includes security baseline, correlation analysis, security rules, threat intelligence, and security knowledge base;

* Security operations and situational awareness. Based on the results of security analysis, this process implements capabilities such as situational visualization, security operations, multi-device security linkage, and security orchestration.

As shown in the figure above, the security field has three main characteristics:

*Quick response. Because security incidents are often sudden, such as the disclosure of vulnerabilities and virus outbreaks, etc., they often occur suddenly in a short period of time, so it is necessary to quickly deal with sudden security incidents in a very effective and easy-to-use way , can respond to customer needs and respond to various security incidents in the first time;

*Scene customization. Because security analysis is somewhat different from conventional big data analysis scenarios, security analysis detects abnormalities rather than normal, and there are some unique requirements in the field, so this will involve the needs of custom development in many fields.

*Resources are limited. Compared with the use of conventional Internet big data platforms, there are many restrictions on the resources available for security analysis. Usually, users are limited by budget and resources, and will try to compress and optimize the scale of available computing and storage resources as much as possible, which will lead to A large number of components are deployed in a mixed manner, and the cost of hardware and cluster expansion in the future is also high, and the process is very long.

In terms of real-time data security, there are five requirements:

*The first is the need for real-time analysis. Security detection has strict requirements on latency, and there is a time gap between the attacking and defending parties, so it is necessary to be able to detect anomalies in the first place. At the same time, because it is driven by security events, the solution is required to be launched quickly, respond in a timely manner, and be able to form protection capabilities in the shortest time;

*The second is the need to provide very rich analysis semantics. Security detection scenarios are usually complex, and rich security analysis semantics need to be provided to cover most security analysis scenarios;

*The third is flexible deployment. Because the customer's environment is very complex, it needs to be able to support various big data platforms, such as the customer's self-built or some cloud platforms purchased by him, and also have the greatest version compatibility;

*Fourth is the need to achieve minimal resource usage. It supports the deployment of clusters ranging from one to dozens or even hundreds of nodes, and supports large scale while occupying as little resources as possible, such as running thousands of analysis rules or security baselines at the same time;

The last is stable operation. Security products are deployed on the client side and need to run 7×24 hours a day. It needs to provide extremely high operational stability, minimize manual maintenance and intervention, and make customers feel unaware.


Traditional security analysis methods are generally based on prior knowledge, using feature-based detection technology for anomaly detection, generally by analyzing the data or log characteristics of the detected object, and then manually summarizing and counting the detectable features to establish a security model, such as security Rules, Threat Intelligence, and more. This method is relatively simple and reliable, and can well deal with existing attack methods, but it cannot effectively detect unknown attack methods that do not have corresponding detection features.

The security baseline adopts a behavior-based detection method, uses the detected object data and logs, uses various methods to learn behavioral characteristics, establishes a security baseline, and uses it for anomaly detection.

In order to let everyone better understand the usage scenarios of the security baseline, here are three real scenarios:

Scenario 1: The login of the DBA user account is abnormal. For example, the login location is abnormal, the login time is abnormal, and the behavior of using the database is abnormal. For example, a DBA user usually logs in at a certain time, certain IP or certain location, but suddenly logs in at another place one day, and the time is not the usual login time, then this may be an exception and needs to be generated abnormal event;

Scenario 2: The number of attachments sent by email exceeds the normal value. For example, if the security baseline learning department or the behavior data of emails sent by the entire company find that the number of attachments sent by a certain user is quite different from the historical learning data, then this may be an anomaly;

Scenario 3: The number of recent account logins to the VPN service is abnormal. By learning the VPN historical login behavior of user accounts, a security baseline is built. If abnormal account logins are found in the future, abnormal events will be generated.

2. Choice of computing framework

There are currently two mainstream real-time computing frameworks, Spark and Flink. We originally designed this engine around 2018. At that time, we studied three computing frameworks: Storm, Spark, and Flink. After considering various factors, we finally chose Flink as the underlying computing framework. The Flink used at that time was about version 1.4, which was a relatively mature version. Compared with other frameworks, its API and its underlying distributed and stream computing implementation methods were more in line with our usage scenarios.

The advantages of Flink are more prominent. It is a distributed computing framework with flexible deployment and adapts to the current common big data platforms. It has good processing performance and can achieve high throughput and low latency, which is very suitable for real-time security analysis. It also provides a flexible DataStreaming API to facilitate the realization of customized requirements. In addition, an easy-to-use checkpoint and savepoint mechanism is supported. Moreover, as a very popular computing framework at present, the community is active, and there are rich documents and scene samples.

Although Flink has many advantages, when enterprise resources are limited and the number of rule sets is thousands of scales. Flink has encountered many problems in meeting business and performance requirements. For example, there is no large-scale rule semantics/flow optimization, no customized windows and logic for security scenarios, no operators related to security baselines, and no resource protection mechanisms.

3. Engine design

The engine application framework is divided into three layers:

*The bottom layer is the deployment layer, usually a big data cluster;

*The second layer is the security analysis layer, which builds a security baseline engine based on the Flink DataStreaming API. Flink is responsible for the underlying distributed computing and event stream sending, and the specific business calculations are completed by the security baseline engine. The interface provided by the security baseline engine to the user is rules and DSL. The user sends the rule DSL to the engine through the interface. The engine analyzes and calculates the event flow according to the rules and DSL, and uses external data according to the semantics of the rules, such as knowledge Data, threat intelligence, assets and vulnerabilities, etc.;

*Users manage and use the engine through the application layer of the third layer. And based on engine data results situation analysis, security operations, resource monitoring and other specific security services.


The business process of the engine is divided into three parts, namely user interface, engine service and engine analysis task. Users perform rule configuration, baseline management, and operation monitoring through the user interface. The engine service provides users with rules distribution, baseline distribution, status monitoring and other services in the form of RESTfull API. After the engine service receives the user's rule delivery request, it needs to analyze and optimize the delivered rule set and then generate an analysis task code package. The analysis task code is submitted to the big data cluster for operation, and the analysis task receives the baseline of the engine service during operation Send the data, and perform additions, deletions, and modifications to the runtime baseline. The analysis task also reports the task running status to the engine service, and the engine service maps the task running status into business monitoring information, which is provided for user query and analysis

Since most users are not R&D personnel, it is necessary to provide a security analysis language specifically optimized for security analysis scenarios. It needs to have the following:

*Easy to use, low learning cost, easy to use, even a person without a research and development background can use it after simple learning, and it is in line with the intuitive thinking of security analysts;
*Need to provide rich data types, first of all, to provide rich basic data types, secondly, to provide direct support for data commonly used in security analysis such as IP, various time, assets, vulnerabilities, threat intelligence, geographical location, etc., users do not need to do it Any customization can directly use various data;
*Provide rich semantics, especially the enhancement and customization of security analysis semantics;
*Supports extensions. Although the security analysis semantics provided are relatively comprehensive and support most security analysis scenarios, there are still some special scenarios that cannot be directly supported. At this time, it is necessary to support such requirements in an extended way.

The security analysis language needs to design a special compiler to compile and optimize the security analysis statements and rules designed by security analysts. The compiler needs to provide support for some features and optimizations:

* Common expression optimization. Optimize the same semantic logic in the analysis statement to reduce repeated calculations and computational complexity;
* Reference data table optimization. There may be thousands of analysis statements and rules in a rule set, which will refer to a large number of external data table data, and calculation optimization for table calculations is required, such as hash matching, large-scale IP matching optimization, large-scale regular matching and string matching optimization, etc. ;
* Constant expression optimization. Improve operational performance;
* Table reference optimization. Contains two parts: merging of reference examples and merging of reference semantics, reducing the resource consumption of reference tables.

After the analysis statement and rules are compiled, the running subgraph is generated. Each statement and rule corresponds to a subgraph. At this time, all the rules need to be concentrated for graph optimization, which is divided into four processes:

The first step is graph fusion. Graph fusion involves subgraph fusion, that is, all subgraphs in the rule set are fused into a running graph, and then semantic fusion of graph nodes, time window merging, and optimization of citing public resources are performed;

The second step is to optimize the data flow, reduce the scale of the graph, and reduce the amount of transmitted data. It mainly performs operations such as key preposition, semantically equivalent node fusion, network throughput balance, data skew reduction, node merging, and large-scale compression of the number of nodes in super large graphs;

The third step is field clipping, which reduces the size of transmission events, thereby reducing network IO pressure, mainly including field derivation and clipping on the graph and field merging;

The last is code generation, which will analyze the statement and rule semantics to generate code, and map the execution graph to the Flink DataStream API.

A core element of real-time computing is time, and different time processing methods and implementation schemes will bring about very different or even completely different calculation results. In real-time analysis, time mainly affects two functions, namely time window and timeline.

In the security analysis scenario, the time window needs to support the general sliding time window, as well as the natural time sliding time window, such as every year, every month, every week, etc. Naturally, or even longer, it is necessary to support repeated data fusion of cascaded windows to reduce The amount of data storage can automatically eliminate repeated calculations, avoid repeated alarms, merge time timers, and correctly process events out of order, and avoid miscalculations caused by event out of order.

The timeline can be divided into two categories: event occurrence time and time processing, and then extends the time precision. Different time precisions will cause great pressure on processing performance and storage, such as scenes that need to sort time. Since events may be out of order in real-time analysis, it is necessary to support delay time to solve most of the inaccurate calculation problems caused by out of order. Some calculation scenarios involve mutual conversion between system time <-> event time, and it is necessary to provide two time conversion calculation methods. Since the execution graph is a fusion of a large number of sub-graphs, it is necessary to support the management of global and local time levels at the same time to ensure that the timeline on the graph can be advanced correctly.


Security baselines fall into three categories:

The first category is statistical safety limits, including common safety baselines of time, frequency, space, range and multi-level statistics;

The second is the sequence category, such as exponential smoothing and periodic security baselines;

The third is the security baseline of machine learning, such as security baseline using some clustering algorithms, decision tree security baseline, etc.

The baseline processing process is mainly divided into three parts: baseline learning, baseline detection, and baseline routing, interspersed with processes such as event filtering, time window, baseline noise reduction, and baseline management. The baseline learning process includes reading the event stream from the message queue and storage, after event filtering and time window aggregation, the event stream may contain noise data, and a data noise reduction process is required. Finally, the baseline learning process learns the input event process and generates Corresponding security baseline. After the baseline management process, the learned security baseline is used for anomaly detection, prediction and anomaly detection. If an abnormal behavior is found, an abnormal event will be generated and output to the subsequent processing flow for subsequent business use. Users may need to modify or delete some learned baselines or create a new baseline during use. The addition, deletion, and modification of these baselines are done through the baseline routing function. The baseline routing process routes the baselines edited by the user on the graph and distributes them correctly. to the corresponding graph node instance.

The baseline cycle is divided into four phases: learn, ready, close, and expire:

learn represents the learning phase, during which the baseline learns the input event stream;

The ready stage indicates that the current timeline has reached the learning deadline of the baseline, but because of the delay time, the baseline needs to wait for a delay time, during which the baseline can continue to learn delayed events, and the baseline can be used for anomaly detection;

close indicates that the current timeline has reached the delay time. At this time, the baseline no longer learns the input events and is only used for anomaly detection;

expire indicates that the current timeline has reached the baseline timeout period, and the baseline needs to be stopped for anomaly detection and deleted.

The calculation of the baseline is triggered by two situations:

The first is event-triggered calculation, where each event will trigger an anomaly detection calculation;

The second is time-triggered calculation. The baseline period will register a time timer. After the time timer is triggered, the relevant baseline calculation process will be triggered.

The output of the baseline is divided into baseline exception event output and baseline content output:

The baseline abnormal event output occurs during the baseline abnormal detection process. When an abnormal event is found, the corresponding event needs to be output;

The output of the baseline content occurs after the baseline learning is completed, and the baseline itself needs to be output for baseline editing and abnormal analysis of the baseline itself.


During use, users may often edit existing baselines or create new security baselines for specific scenarios based on some analysis and data. After the baseline is edited, it needs to be delivered to the baseline engine, which involves how to edit and update the baseline online.

First, the baseline needs to be editable, and the analysis language is required to support the semantics of baseline editing. At the same time, the design of the baseline data structure needs to support the semantics of baseline editing. Finally, a set of baseline editing visualization processes must be provided, including functions such as baseline display, modification, and deletion. , the user can edit and issue the baseline directly on the page;

The second is that the baseline is required to be routable. The actual execution graph of the analysis statement and rules after compilation and graph optimization will be very different from the rules displayed on the page. A routable baseline requires the construction of a global baseline at compile time. Update the flow graph, a set of runtime baseline routing methods, including routing flow construction of the execution flow graph, broadcast and directional routing support, and finally achieve accurate baseline data distribution;

Finally, the baseline is also required to be updatable. It is necessary to have a clear set of baseline update semantics, define the baseline operation cycle and calculation method, and then during your baseline update process, an exception may occur at any position and cause the update to fail. At this time, you need to design A set of mechanisms can feed back information to users after failures at any location for error determination and problem repair.

In the baseline learning process, the learning period is usually relatively long, such as the last week, the last month, etc. Long-term learning usually faces a problem of data segmentation, such as learning the data of the last week, but now it is Wednesday, that is, It is said that the data of the last week is divided into two parts, in which the data from Monday to Tuesday is stored in the historical data storage, and the data on Wednesday and later is real-time, which involves the fusion of historical and real-time data. There are three cases here:

The first is that all the data to be learned is historical data, which needs to support historical data learning range detection and online baseline update;

The second is that all the data to be learned are real-time data, which requires support for automatic baseline learning, automatic baseline detection, and automatic baseline update;

The third is the most complex situation just exemplified, that is, the fusion of historical and real-time data, which needs to support the boundary division of historical and real-time data, baseline fusion, and deduplication.

There are usually some noises in the data of baseline learning. These noises may be an abnormal operation, such as an abnormal login of a user, or incorrect data introduced during the data collection process. Therefore, noise needs to be eliminated to increase Baseline accuracy, reducing false positives.

Data noise reduction can be simply divided into numerical data noise reduction and non-numerical data noise reduction according to the data type, and the two processing methods will be different. There are four main types of noise:

The first is to judge relative to the data of this cycle, that is, compare it with other data of this cycle to judge whether it is noise;

The second is to compare with the data of the last cycle and compare it with the data of the latest cycle to judge whether it is noise;

The third is to compare historical data with all historical data to determine whether it is noise;

Finally, the user defines a noise judgment logic, such as setting how much greater than or less than how much it is noise.
When data is denoising, it is usually necessary to save relevant data. For example, to use historical data for noise determination, then it is necessary to store some historical key data. There are usually a lot of historical data. In order to reduce storage, it is necessary to optimize the noise reduction data structure. Optimization, such as minimizing key data for noise reduction, field pruning, etc.

A very important part of engine operation is how to monitor and protect resources. Involves three aspects:

The first one is stability enhancement. It is necessary to dynamically monitor the memory usage during the baseline running process. There may be hundreds or even thousands of baseline rules running in the engine at the same time. It is necessary to be able to monitor the memory usage of each baseline rule. For rules with abnormal memory usage , you need to take memory protection measures, such as deleting some data or isolating it to prevent the operation of other normal rules from being affected. You can use resource priority management when deleting. If the priority is relatively low and it takes up a lot of resources at the same time, It is possible to reduce or even disable the operation of the rules by reducing resources. The engine also monitors the baseline calculation process. If a slow path that seriously affects the performance of graph processing is found during the monitoring process, the subgraph corresponding to the slow path is isolated by subgraph isolation to prevent other analysis processes from being affected;

The second is status monitoring. Status monitoring consists of two parts. The first part is that the engine reports the status data of all computing nodes in the execution graph, such as CPU, memory, disk, input and output, etc. to the monitoring service; the second part is that the monitoring service will After processing, the operation information of the execution graph is mapped to the rule state information, and the transformation from the graph state to the business state is completed. For large-scale execution of graphs and high-concurrency analysis tasks, it is necessary to optimize the graph status reporting process to reduce resource consumption;

The third is flow control. The downstream business of the engine may be some processes with relatively slow processing capabilities, such as database writing. At this time, flow control needs to be supported to prevent the faster processing flow from inputting too much data to the slower processing flow. Cause excessive consumption of resources and lag, flow control needs to support active flow control, passive flow control and time window-related flow control, through user configuration or automatic processing to solve data loss and system instability caused by different processing performance before and after.

Users often need to operate the rules during use. These operations will cause the start and stop of the running task. During the start and stop process, the data must be consistent before and after, and the saved data cannot be lost due to the start and stop.

Flink itself supports reloading data when the task is restarted, but the problem here is more complicated in the baseline engine, because users may disable, enable or modify the rules, which will cause changes in the rule set, which in turn will cause changes in the execution graph. The rules that do not change when restarting can be correctly loaded from the savepoint to the correct data. It is necessary to support the local state stability of the graph, that is, the local changes of the graph during the graph optimization process will not affect other subgraphs, and at the same time ensure the generation of stable subgraphs during the code generation process. The code is executed stably, and the changing rules only affect the subgraphs related to it, while other unchanged rules are not affected.

During the baseline learning process, a large amount of intermediate data is usually saved. In order to speed up savepoint and checkpoint, serialization and deserialization of complex data structures need to be optimized, and incremental status needs to be supported. Engine services usually need to provide analysis services to multiple users, so it is also necessary to manage the status of multi-user and multi-task to ensure that each task can be accurately associated with its corresponding status data.


4. Practice and prospect

The real-time security analysis capability provided by the analysis engine serves most of the company's big data products, such as big data and security operation platform, situational awareness, EDR, cloud security, industrial control Internet, intelligent security, etc. With the deployment of these products, nearly a thousand customers, including central enterprises, governments, banks, public security, etc., also support common localization systems and various private clouds. The deployment environment ranges from one to several hundred clusters, and the event volume ranges from hundreds to millions of EPS. It has also participated in and supported hundreds of special actions by ministries and central enterprises.

With the spread of knowledge and the frequent occurrence of various security vulnerabilities, various attack methods and security threats emerge in an endless stream, which requires higher and higher security analysis capabilities, requiring the engine to be continuously updated and optimized to improve security. For attack detection capabilities, it is necessary to continue to integrate more and better behavior learning algorithms and technologies with security baselines to improve the detection capabilities of security baselines. At the same time, it is hoped that some practices of the engine can be given back to the community through certain channels, so that more people can use the good designs and practices.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us