Introduction to intelligent inspection - - Alibaba Cloud Documentation Center

The intelligent inspection feature of Simple Log Service allows you to inspect data such as logs and metrics and identify anomalies in the data in an automated, intelligent, and adaptive manner. This topic describes the background information, workflow, functionalities, terms, scheduling, and scenarios of the intelligent inspection feature. This topic also provides suggestions on how to use the intelligent inspection feature.

Background information

Time-based data, such as logs and metrics, can accumulate over time. For example, if 10 million data entries are generated per day for a service, a total of 3.6 billion data entries are accumulated per year for the service. If you use determinate rules to inspect the data entries, you may encounter the following issues:

Low efficiency: To identify anomalies, you must manually configure various inspection rules based on your business requirements.
Low timeliness: Most time series data is time-sensitive. Faults and changes affect the patterns in which metrics are displayed. An anomaly that is identified at the current point in time based on a specific rule may be considered normal at a later point in time.
Complex configuration: Time series data comes in various forms. For example, time series data can show spike-shaped changes, turning point-like changes, or periodic changes. Time series data in different forms can also have different threshold ranges. You may need to spend a large amount of time to configure inspection rules that are used to identify anomalies in time series data.
Low accuracy: Data streams vary based on business models. Determinate inspection rules may result in a large number of false positives and false negatives. Users have different degrees of tolerance for anomalies in different scenarios. When you troubleshoot anomalies, a large number of accurate alerts can help you identify issues. When you handle alerts, a small number of important alerts can help increase your handling efficiency.

To resolve the preceding issues, Simple Log Service provides the intelligent inspection feature. This feature is integrated with the self-developed AI algorithms of Alibaba Cloud to aggregate, inspect, and generate alerts for streaming data such as logs and metrics. After you enable the intelligent inspection feature, you need to only specify the metrics that you want to inspect. You do not need to configure inspection rules. This feature can automatically identify anomalies, adapt to your business changes, and generate fine-grained alerts.

Workflow

Simple Log Service extracts or aggregates metrics by using consumer groups or SQL statements, collects data based on scheduling rules and ingests the data into algorithm models on a regular basis, writes the inspection results as events to the destination Logstore named internal-ml-log, and then sends alert notifications to notify you of the identified anomalies. The following figure shows the workflow.

Functionalities

The following table describes the functionalities of the intelligent inspection feature.

Functionality	Description
Configure metrics	For standard metrics, you can directly configure a consumer group to initiate an intelligent inspection task. For non-numeric logs, you can use SQL statements or query statements to convert the logs into metrics. Then, you can initiate an intelligent inspection task.
Analyze data on a regular basis	You can specify data features and configure the entities and metrics that you want to inspect based on your business requirements. After an intelligent inspection instance is created, the instance automatically discovers new entities, and collects and ingests the data of the entities into the associated algorithm models on a regular basis to analyze the data in an intelligent manner. The interval at which an intelligent inspection instance collects and ingests data can be accurate to the second.
Configure parameters and preview the results that are generated by algorithm models	After you configure the parameters of an algorithm model, you can preview the results that are generated by the algorithm model. You can also view the time series curve and anomaly score curve of each metric. The intelligent inspection feature allows you to configure the parameters of an algorithm model based on the features of your data.
Provide inspection results over multiple notification channels	Inspection results are stored in the destination Logstore that you specify. Anomaly information is sent to you by using alert notifications. The time series features that are identified in the inspected data are stored in a dedicated Logstore. You can provide feedback on or label the alerts that are generated.

Terms

The following table describes the terms that are related to the intelligent inspection feature.

Term	Description
job	An intelligent inspection task maps an intelligent inspection job, which includes information such as data features, algorithm model parameters, and alert policies.
instance	An intelligent inspection job creates an intelligent inspection instance based on the configuration of the job. The instance pulls data, runs algorithm models, and distributes inspection results on a regular basis based on the configuration of the job. Only one instance can run in a job at a time regardless of whether the instance is run on schedule or is retried due to an anomaly. You cannot concurrently run multiple instances in a single job. Hot upgrades are not supported for parameters. If you modify the configuration of a job, the job re-creates an instance to run algorithm models. The new instance is not related to the previous instance. For more information about the impacts of different operations on the scheduling and running of instances, see Scheduling and scenarios.
instance ID	Each intelligent inspection instance is identified by an instance ID, which is unique.
creation time	Each intelligent inspection instance is created at a specific point in time. In most cases, an intelligent inspection instance is created for an intelligent inspection job based on the scheduling rules of the job. If historical data needs to be processed or the delay caused by the timeout of the previous instance is offset, an instance is immediately created.
start time	Each intelligent inspection instance starts to run at a specific point in time. If a job is retried, the start time is the time when the last instance of the job starts to run.
end time	Each intelligent inspection instance stops running at a specific point in time. If a job is retried, the end time is the time when the last instance of the job stops running.
status	Each intelligent inspection instance is in a specific state at a specific point in time. Valid values: RUNNING STARTING SUCCEEDED FAILED
data feature	Data features include the following items: Observation granularity: the time interval at which Simple Log Service collects and inspects data. Algorithm models are also run at this interval to analyze data. The observation granularity of a job varies based on the configuration of the job. The observation granularity does not vary even if the previous instance times out, is delayed, or runs to process historical data. In most cases, Simple Log Service inspects data streams at regular intervals. The intervals are accurate to the second. Time: the field that is used to indicate the time of each sample. You must specify the time parameter. You can specify only one field in the time parameter. Entity: one or more fields that are used to identify the entity to inspect. Feature: the field that specifies the metric to inspect. You can specify multiple metrics. You can specify a value range for each metric. This way, algorithm models can identify anomalies in a more accurate manner.
algorithm configuration	An algorithm configuration includes the following items: Observation length: the number of samples that the algorithm model inspects. Valid values: 200 to 4000. Sensitivity: the sensitivity based on which the algorithm generates anomaly scores. A higher sensitivity leads to a higher anomaly score.
inspection event	An inspection event includes the following items: Entity information: the data source from which the inspection event is obtained. Configuration information: the configuration of the job that generates the inspection event. Anomaly score: the score of the anomaly that is identified in the inspection event. Valid values: 0 to 1. If the score of an anomaly is greater than 0.75, Simple Log Service sends an alert notification. Anomaly type: the type of the anomaly that is identified in the inspection event. Simple Log Service classifies anomalies into the following types: Stab anomalies, Shift anomalies, Variance anomalies, Lack anomalies, and OverThreshold anomalies.

Scheduling and scenarios

The following table describes the scheduling and common scenarios of an intelligent inspection job.

Scenario	Description
Start an intelligent inspection job at the current point in time	If you start a job at the current point in time, algorithm models cannot pull historical data. The job accumulates 200 samples before an inspection event is generated. The accuracy of anomaly identification increases based on the number of samples that are obtained.
Start an intelligent inspection job at a historical point in time	After you create a job and configure the job to start at a historical point in time, algorithm models analyze historical data at a high speed based on the configuration of the job and gradually catch up with the data that is generated at the current point in time. After the creation time of the job arrives, the job starts to generate inspection events.
Modify the scheduling rules of an intelligent inspection job	After you modify the scheduling rules of a job, the job generates an instance based on the new scheduling rules. Algorithm models record the point in time before which all historical data is analyzed and continue to analyze the most recent data.
Retry an intelligent inspection instance that fails to run	If an instance fails to run due to issues such as insufficient permissions, unavailable source Logstores, unavailable destination Logstores, and invalid configurations, Simple Log Service can automatically retry to run the instance. If an instance is stuck in the STARTING state, the configuration operation of the instance may fail. Simple Log Service generates an error log and sends the log to the internal-etl-log Logstore. You can verify the configuration of the instance and restart the instance. After the instance is scheduled and run, Simple Log Service changes the status of the instance to SUCCEEDED or FAILED based on the retry result.

Suggestions

We recommend that you specify the metrics to inspect based on your business requirements. This can help improve the efficiency of intelligent inspection. Take note of the following information when you use the intelligent inspection feature:

Specify the format of the data that is uploaded to the specified Logstore, define the fields that are included in the data, and specify the observation granularity.
Obtain the metric data changes of the entities that you specify, understand the stability and periodicity of the metric data, and formulate preliminary expectations for anomalies. These operations can help you configure the parameters of an algorithm model.
Align the observation granularity to integer seconds, integer minutes, or integer hours. This way, you can receive accurate alerts at the earliest opportunity.