Intelligent Log Anomaly Detection Overview - Simple Log Service

The Intelligent Anomaly Analysis application of Simple Log Service provides model training and real-time inspection features. These features support automated, intelligent, and adaptive model training and anomaly detection for data such as logs and metrics. This topic describes the background information, workflow, features, terms, scheduling, and use scenarios of the intelligent inspection feature. This topic also provides suggestions on how to use the feature.

Important

The Intelligent Anomaly Analysis application in Simple Log Service is being phased out and will no longer be available on July 15, 2025 (UTC+8).

Impact scope
Intelligent inspection, text analysis, and time series forecasting will no longer be available.
Feature replacement
The preceding features can be fully replaced by the machine learning syntax, scheduled SQL and dashboard features of Simple Log Service. Documentation will be provided to help you configure feature-related settings.

Background information

Time-based data, such as logs and metrics, can accumulate over time. For example, if 10 million data entries are generated per day for a service, a total of 3.6 billion data entries are accumulated per year for the service. If you use determinate rules to inspect the data entries, you may encounter the following issues:

Low efficiency: To identify anomalies, you must manually configure various inspection rules based on your business requirements.
Low timeliness: Most time series data is time-sensitive. Faults and changes affect the patterns in which metrics are displayed. An anomaly that is identified at the current time based on a specific rule may be considered normal at a later time.
Complex configuration: Time series data comes in various forms. For example, time series data can show spike-shaped changes, turning point-like changes, and periodic changes. Time series data of different data types can have different threshold ranges. You may need to spend a large amount of time to configure inspection rules that are used to identify anomalies in time series data.
Low accuracy: Data streams vary based on business models. Determinate inspection rules may result in many false positives and false negatives. Users have different degrees of tolerance for anomalies in different scenarios. When you troubleshoot anomalies, many accurate alerts can help you identify issues. When you handle alerts, a small number of important alerts can help increase your handling efficiency.

To resolve these issues, Simple Log Service provides the intelligent inspection feature. This feature is integrated with the proprietary AI algorithms of Alibaba Cloud to aggregate, inspect, and generate alerts for streaming data such as logs and metrics. After you enable this feature, you need to specify only the entities and metrics that you want to inspect. You do not need to configure inspection rules. This feature can automatically identify anomalies, adapt to your business changes, and generate fine-grained alerts.

How it works

Simple Log Service uses SQL statements to construct and aggregate metrics, pulls data into algorithm models at regular intervals based on scheduling rules, writes inspection results to a destination logstore (internal-ml-log) in the event format, and sends alert notifications for anomalies. The following figure shows the workflow.

Features

The following table describes the functionalities of the intelligent inspection feature.

Feature	Description
Configure metrics	You can configure SQL statements or query analysis statements to convert log data into metrics and create tasks.
Analyze data on a regular basis	You can configure data features based on the entities and metrics that you want to inspect. After an intelligent inspection instance is created, the instance automatically discovers new entities, and collects and ingests the data of the entities into the associated algorithm models on a regular basis to analyze the data in an intelligent manner. The interval at which an intelligent inspection instance collects and ingests data can be accurate to the second.
Configure parameters and preview the results that are generated by algorithm models	After you configure the parameters of an algorithm model, you can preview the results that are generated by the algorithm model. You can also view the time series curve and anomaly score curve of each metric. The intelligent inspection feature helps you configure the parameters of an algorithm model based on the features of your data.
Data export by using multiple methods	Inspection results are stored in the destination logstore that you specify. Anomaly information is sent to you by using alert notifications.

Terms

The following table describes the terms that are related to the intelligent inspection feature.

Term	Description
Task	An intelligent inspection task includes information such as data features, model parameters, and alert policies.
Instance	An intelligent inspection task creates an instance based on the configuration of the task. The instance pulls data, runs algorithm models, and distributes inspection results at regular intervals based on the configuration of the task. Only one instance can run in a task at a time regardless of whether the instance is run on schedule or is retried due to an anomaly. You cannot concurrently run multiple instances in a single task. You cannot modify the configuration of a task when the task is running. If you modify the configuration of a task, the task re-creates an instance and runs the algorithm model of the new instance. The new instance is not related to the previous instance. For information about how different operations affect the scheduling and running of instances, see Scheduling and running scenarios.
Instance ID	Each intelligent inspection instance is identified by a unique ID.
Creation time	Each instance is created at a specific point in time. In most cases, an instance is created based on the scheduling rules that you specify. If historical data needs to be processed or if latency exists and needs to be offset, an instance is immediately created.
Start time	Each intelligent inspection instance starts to run at a specific point in time. If a job is retried, the start time is the time when the last instance of the job starts to run.
End time	Each intelligent inspection instance stops running at a specific point in time. If the job to which an instance belongs is retried, the end time is the most recent time at which the instance stops running.
Status	Each intelligent inspection instance is in a specific state at a specific point in time. Valid values: RUNNING STARTING SUCCEEDED FAILED
Data features	Data features include the following items: Observation granularity: The time interval at which Simple Log Service collects and inspects data. Algorithm models are also run at this interval to analyze data. The observation granularity of a task varies based on the configuration of the task. The observation granularity does not vary even if the previous instance times out, is delayed, or runs to process historical data. In most cases, Simple Log Service inspects data streams at regular intervals. The intervals are accurate to the second. Time: The field that is used to indicate the time of each sample. You must specify the time parameter. You can specify only one field in the time parameter. Entity: One or more fields that are used to identify the entity you want to inspect. Feature: The field that specifies the metric you want to inspect. You can specify multiple metrics. You can specify a value range for each metric to help the algorithm model detect anomalies in a more accurate manner.
Algorithm configuration	Different algorithms have different configuration items. For information about the configuration items of each algorithm, see Detect anomalies in real time by aggregating metrics by using SQL statements.
Inspection event	Each inspection event includes the following items: Entity information: The data source from which the inspection event is obtained. Configuration: The configuration of the task that generates the inspection event. Anomaly score: The score of the anomaly that is identified in the inspection event. Valid values: 0 to 1. If the score of an anomaly is greater than 0.75, Simple Log Service sends an alert notification. Anomaly type: The type of the anomaly that is identified in the inspection event. Simple Log Service classifies anomalies into the following types: Stab anomalies, Shift anomalies, Variance anomalies, Lack anomalies, and OverThreshold anomalies.

Scheduling and use scenarios

The following table describes the scheduling and scenarios of an intelligent inspection task.

Scenario	Description
Start an intelligent inspection task at a historical point in time	After you create an intelligent inspection task at the current point in time, the task processes historical data based on the task rules. Algorithm models quickly consume historical data, train models, and gradually catch up with the current time. After the task creation time or model learning end time is reached, inspection events are generated.
Modify the scheduling rules of an intelligent inspection job	After you modify the scheduling rules of a job, the job generates an instance based on the new scheduling rules. Algorithm models record the point in time before which all historical data is analyzed and continue to analyze the most recent data.
Retry an intelligent inspection instance that fails to run	If an instance fails to run due to issues such as insufficient permissions, unavailable source logstores, unavailable destination logstores, and invalid configurations, Simple Log Service can automatically retry to run the instance. If an instance is stuck in the STARTING state, the configuration of the instance may fail. Simple Log Service generates an error log and sends the log to the internal-etl-log logstore. You can verify the configuration of the instance and restart the instance. After the instance is scheduled and run, Simple Log Service changes the status of the instance to SUCCEEDED or FAILED based on the retry result.

Recommendations

We recommend that you specify the metrics you want to inspect based on your business requirements. This way, you can improve the efficiency of intelligent inspection. The following rules apply:

Specify the format of data that is uploaded to the specified logstore, define fields for the data, and specify the observation granularity. These are the basic operations that you must perform to configure an intelligent inspection task.
Obtain the metric data changes of the entities that you specify, understand the stability and periodicity of the metric data, and formulate preliminary expectations for anomalies. These operations help you configure the parameters of an algorithm model.
Align the observation granularity to integer seconds, integer minutes, or integer hours. This way, you can receive accurate alerts in a timely manner.

Model training

You can use model training to reinforce learning in anomaly detection and improve the accuracy of alerts for anomaly detection. Model training has the following benefits:

If you use only real-time inspection, the accuracy of anomaly detection cannot meet your expectation. In this case, you can improve the accuracy by using model training tasks.
If a GAP value exists between the anomaly detected by a real-time inspection task and the anomaly that you predict, we recommend that you run a model training task to detect anomalies based on your business requirements.

Process

Data input: Write data that is required for a model training task. The data includes labeled metrics and unlabeled metrics. The data is stored in Log Service and can be obtained by using SQL statements. The labeled metrics can be directly used in the algorithm service, and the unlabeled metrics can be used in the algorithm service only after the unlabeled metrics are labeled by using the anomaly injection simulation method.
Algorithm service: The algorithm service consists of feature engineering and supervision model. In the algorithm service, a model is trained for each entity, and each model can be identified by an entity ID.
Result storage and visualization: After a model training task is complete, the system stores the trained models on the cloud and stores the verification results of datasets and the events generated by the task in a logstore named internal-ml-log in the log format. You can view the visualization results in the task details.
Forecasting task creation: After a model training task is complete, you can obtain the models that are trained for entities in the task. Then, you can create a forecasting task to detect anomalies for metrics in real time and add labels to the detection results by using the tools provided by Log Service. This way, the accuracy of the models is improved.

Algorithm service overview

The algorithm service consists of the following items:

Dataset: A dataset is constructed based on a specified time range. Datasets are classified into training sets and validation sets.
The duration of a training set must be greater than 12 days because a model training task requires a week of historical data for feature engineering. The duration of a validation set must be greater than 3 days because 3 days of data are required to generate a detailed validation report on the fitness, robustness, and performance of the models.
Feature engineering: Features include interval-valued comparison and periodicity-valued comparison, translation, trending, windowing, and timing.
Model integration: You can integrate multiple tree-based models to construct the final model.