The intelligent inspection feature of Log Service allows you to inspect data such as logs and metrics and identify exceptions in the data in an automated, intelligent, and adaptive manner. This topic describes the background information, workflow, functionality, terms, and scheduling and use scenarios of the intelligent inspection feature. This topic also provides suggestions on how to use the intelligent inspection feature.

Background information

Time series data, such as logs and metrics, can pile up over time. For example, if 10 million data records are generated per day for a service, a total of 3.6 billion data records pile up per year for that service. If you use determinate rules to inspect the data records, you may face the following issues:

  • Low efficiency: To identify exceptions, you must manually configure various inspection rules based on your business requirements.
  • Low timeliness: Most time series data is time-sensitive. Faults and changes affect the patterns in which metric data is displayed. An exception that is identified at the current point in time based on a specific rule may be considered normal at a later point in time.
  • Complex configuration: Time series data comes in various forms. For example, some time series data shows spike-shaped changes, some shows turning point-like changes, and some shows periodic changes. In addition, time series data in different forms can have different threshold ranges. You may need to spend a large amount of time to configure inspection rules that are used to identify exceptions in time series data.
  • Low accuracy: Data streams dynamically change as your business changes. Determinate inspection rules may result in a large number of false or omitted alerts. Different users have different degrees of tolerance for exceptions in different use scenarios. When you troubleshoot exceptions, a larger number of positive alerts can help increase your troubleshooting efficiency. When you handle alerts, a smaller number of important alerts can help increase your handling efficiency.

To resolve the preceding issues, Log Service provides the intelligent inspection feature. This feature is integrated with the proprietary AI algorithms of Alibaba Cloud to aggregate, inspect, and generate alerts for streaming data such as logs and metrics. After you enable this feature, you need only to specify the entities and metrics that you want to inspect. You do not need to configure inspection rules. This feature can automatically identify exceptions, adapt to your business changes, and generate fine-grained alerts.

Workflow

Log Service extracts or aggregates metrics by using consumer groups or SQL statements, ingests data into algorithm models at a determinate interval based on scheduling rules, writes the inspection results as events to the destination Logstore named internal-ml-log, and then sends alerts to notify you of the identified exceptions. The following figure shows the workflow.

ml

Functionality

The following table describes the operations that you can perform by using the intelligent inspection feature.

Operaton Description
Configure metrics
  • For standard metric data, you can directly configure a consumer group to initiate an intelligent inspection task.
  • For non-numeric log data, you can use SQL statements or query statements to convert the log data into metric data. Then, you can initiate an intelligent inspection task.
Analyze data at a determinate time interval You can specify data characteristics and configure the entities and metrics that you want to inspect based on your business requirements. After an intelligent inspection instance is created, it automatically discovers new entities and ingests the data of the entities into the algorithm models associated with these entities at a determinate time interval to analyze the data. The determinate time interval can be accurate to the second.
Configure the parameters and preview the results generated by algorithm models After you configure the parameters of an algorithm model, you can preview the results generated by the algorithm model. You can also view the time series curve and exception score curve of each specified metric. The intelligent inspection feature helps you easily configure the parameters of an algorithm model based on the characteristics of your data.
Provide inspection results over multiple notification channels Inspection results are stored in the destination Logstore that you specify. Exception information is sent to you as alerts. Time series characteristics that are identified in the inspected data are stored in a dedicated Logstore and wait for you to label the alerts that are generated.

Terms

The following table describes the terms related to the intelligent inspection feature.

Term Description
job An intelligent inspection task maps an intelligent inspection job and includes data characteristics, algorithm model parameters, and alert policies.
instance An intelligent inspection job creates an intelligent inspection instance based on the configuration of the job. The instance pulls data, runs algorithm models, and distributes inspection results at a determinate time interval based on the configuration of the job.
  • Only one instance can run in a job at a time. You cannot concurrently run multiple instances in a single job.
  • Hot upgrades are not supported for parameters. If you modify the configuration of a job, the job re-creates an instance to run algorithm models. The new instance is not related to the previous instance.
  • For more information about the impacts of different operations on the scheduling and running of instances, see Scheduling and use scenarios.
instance ID Each intelligent inspection instance is identified by a unique ID.
creation time Each intelligent inspection instance is created at a specific point in time. In most cases, an intelligent inspection instance is created for an intelligent inspection job based on the scheduling rules of the job.
start time Each intelligent inspection instance starts to run at a specific point in time. If the job to which an instance belongs is retried, the start time is the most recent time at which the instance starts to run.
end time Each intelligent inspection instance stops running at a specific point in time. If the job to which an instance belongs is retried, the end time is the most recent time at which the instance stops running.
status Each intelligent inspection instance is in a specific state at a specific point in time. Valid values:
  • RUNNING
  • STARTING
  • SUCCEEDED
  • FAILED
data characteristics Data characteristics include the following items:
  • Observation granularity: the time interval at which Log Service inspects and collects data. This time interval is also the time interval at which algorithm models are run to analyze data. The observation granularity of a job varies based only on the configuration of the job. The observation granularity does not vary even if a previous instance is created to analyze historical data, times out, or lags behind the expected processing speed. In most cases, Log Service inspects data streams at a determinate time interval. The determinate time interval can be accurate to the second.
  • Time: the field that is used to indicate the time of each sample. You must specify the time parameter and you can specify only one field in the time parameter.
  • Entity: one or more fields that are used to identify the entity you want to inspect.
  • Feature: the field that specifies the metric you want to inspect. You can specify multiple metrics. You can also specify a value range for each metric. This way, algorithm models can identify exceptions in a more accurate manner.
algorithm model configuration The configuration of an algorithm model includes the following items:
  • Observation length: the number of samples that the algorithm model inspects. Valid values: 200 to 4000.
  • Sensitivity: the sensitivity based on which the algorithm model generates scores for exceptions. A higher sensitivity indicates a higher score for the same exception.
inspection event An inspection event includes the following items:
  • Entity information: the data source from which the inspection event is obtained.
  • Configuration information: the configuration of the job that generates the inspection event.
  • Exception score: the score of the exception that is identified in the inspection event. Valid values: 0 to 1. If the score of an exception is less than 0.75, Log Service sends you an alert.
  • Exception type: the type of the exception that is identified in the inspection event. Log Service classifies exceptions into Stab exceptions, Shift exceptions, Variance exceptions, Lack exceptions, and OverThreshold exceptions.

Scheduling and use scenarios

The following table describes the major scheduling and use scenarios of an intelligent inspection job.

Scenario Description
Start an intelligent inspection job at the current point in time If you start a job at the current point in time, algorithm models cannot pull historical data. The job accumulates 200 samples before it generates an inspection event. The accuracy of exception identification increases with the number of samples that are obtained.
Start an intelligent inspection job at a historical point in time After you create a job and configure the job to start at a historical point in time, algorithm models analyze historical data at a high speed based on the configuration of the job and gradually catch up with the data that is obtained at the current point in time. After the creation time of the job arrives, the job starts to generate inspection events.
Modify the scheduling rules of an intelligent inspection job If you modify the scheduling rules of a job, the job re-creates an instance based on the new scheduling rules. Algorithm models record the point in time before which all historical data is analyzed and continue to analyze the most recent data.
Retry an intelligent inspection instance that fails to run If an instance fails to run due to issues, such as insufficient permissions, unavailable source Logstore, unavailable destination Logstore, and invalid configurations, Log Service can automatically retry to run the instance. If an instance is stuck in the STARTING state, the configuration of the instance may have failed. Log Service generates an error log record and sends the record to the internal-etl-log Logstore. You can verify the configuration of the instance and start the instance again. After the instance is scheduled and run, Log Service changes the status of the instance to SUCCEEDED or FAILED based on the retry result.

Suggestions

We recommend that you specify the metrics you want to inspect based on your business requirements. This way, you can improve the efficiency of intelligent inspection. The following suggestions are provided:

  • Specify the format of the data uploaded to the specified Logstore, define the fields contained in the data, and specify the observation granularity. These are the basic operations that you must perform to configure an intelligent inspection job.
  • Obtain the metric data changes of the entities that you specify, understand the stability and periodicity of the metric data, and formulate preliminary expectations for exceptions. These operations help you configure the parameters of an algorithm model.
  • Align the observation granularity to integer seconds, integer minutes, or integer hours. This way, you can receive accurate alerts in a timely manner.