Multivariate pattern identification functions - Simple Log Service

Simple Log Service uses the anomaly detection feature to automatically identify abnormal states in service systems and their root causes. The feature combines machine learning with a metric's current pattern to detect deviations from normal behavior. The multivariate pattern identification functions support multi-dimensional anomaly detection across correlated metrics.

Multivariate pattern recognition function list

Function name	Syntax	Description	Data type of the return value
summarize function	summarize(array(array(double)) data_samples) summarize(array(array(double)) data_samples, array(double) weights)	Identifies and returns a multivariate pattern based on the specified samples and sample weights. The sample weights are optional. Statistical patterns cover a variety of statistics and joint statistics, such as a mean, standard deviation, and covariance matrix.	varchar
merge_summary function	merge_summary(varchar summary1, varchar summary2) merge_summary(varchar summary1, double weight1, varchar summary2, double weight2)	Merges multivariate patterns that are returned by the summarize function. The multivariate patterns can be patterns that are obtained by learning from the same dataset in different stages or patterns that are obtained by learning from two independent datasets. For more information, see summarize function.	varchar
normalize_vector function	normalize_vector(varchar summary, array(double) x_vector)	Normalizes a new sample vector specified by the `x_vector` parameter based on a pattern specified by the summary parameter. Make sure that each component of the vector is mapped to the [0, 1] interval. You can set the summary parameter to a pattern that is returned by the summarize function. For more information, see summarize function.	array(double)
standardize_vector function	standardize_vector(varchar summary, array(double) x_vector)	Standardizes a new sample vector specified by the `x_vector` parameter based on a pattern specified by the summary parameter. Make sure that the components of the vector have a mean of 0 and a standard deviation of 1. You can set the summary parameter to a pattern that is returned by the summarize function. For more information, see summarize function.	array(double)
mah_distance function	mah_distance(varchar summary, array(double) x_vector)	Calculates the Mahalanobis distance for a new sample vector specified by the `x_vector` parameter based on a pattern specified by the summary parameter. You can set the summary parameter to a pattern that is returned by the summarize function. For more information, see summarize function. You can use the Mahalanobis distance to handle the scale differences between different variables in an effective manner. The Mahalanobis distance measures the distance between the sample vector specified by the `x_vector` parameter after the vector is standardized and the centroid of the vector. If the Mahalanobis distance is 1, the distance between the sample vector and the centroid of the vector is equal to the average distance between all vectors and the centroid of the vectors.	double
standard_distance function	standard_distance(varchar summary, double metric_value, int element_index)	Calculates the standardized distance for a metric specified by the `metric_value` parameter based on a pattern specified by the `summary` parameter. You can set the summary parameter to a pattern that is returned by the summarize function. For more information, see summarize function. The standardized distance is similar to the Mahalanobis distance. The Mahalanobis distance measures the standardized distance between a vector that consists of multiple metrics and the centroid of the vector. The standardized distance measures the distance between a metric of a vector and the centroid of the vector. The `element_index` parameter specifies the index of the metric. The `metric_value` parameter specifies the value of the metric.	double
anomaly_level function	anomaly_level(varchar summary, array(double) x_vector) anomaly_level(varchar summary, array(double) x_vector, int element_index)	Calculates the Mahalanobis distance for a new sample vector specified by the `x_vector` parameter based on a pattern specified by the summary parameter and rounds down each distance value to obtain different levels of abnormal probabilities. You can set the summary parameter to a pattern that is returned by the summarize function. For more information, see summarize function. If the return value is 0.1, the probability that the new sample vector is abnormal is 10%. This value indicates first-level anomalies. If the return value is 0.01, the probability that the new sample vector is abnormal is 1%. This value indicates second-level anomalies. If the return value is 0.001, the probability that the new sample vector is abnormal is 0.1%. This value indicates third-level anomalies. If the return value is 0.0001, the probability that the new sample vector is abnormal is 0.01%. This value indicates fourth-level anomalies.An increase in the level of anomalies indicates a decrease in the abnormal probability of the new sample vector and an increase in the suspiciability of the vector. In most cases, you can configure a threshold to filter anomalies based on the anomaly detection results. For example, you can configure a threshold to retain only fourth-level anomalies and higher. If `element_index` is specified, the function calculates the anomaly probability only for the component at the specified index. If it is not specified, the function calculates the anomaly probability for all components.	array(double)

summarize function

The summarize function identifies and returns a multivariate pattern based on the specified samples and sample weights. The sample weights are optional. Statistical patterns cover a variety of statistics and joint statistics, such as a mean, standard deviation, and covariance matrix.

varchar summarize(array(array(double)) data_samples)

varchar summarize(array(array(double)) data_samples, array(double) weights)

Parameter	Description
`data_samples`	A two-dimensional array. The array can be used as a two-dimensional table. Each column specifies a variable. Each row specifies the variable values of a sample.
`weights`	Optional. A one-dimensional array with the same length as the first dimension of `data_samples`. Specifies the weight of each sample. If not specified, all samples have equal weights.

Example

Query statement

* | with data_table as
(
    select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
    select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
    select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
    select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
    select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
    select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
    select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
    select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
)
    select entity_group,
        summarize(array_agg(features)) as statistical_summary
    from data_table
    group by entity_group

Return value

entity_group

statistical_summary

{
  "sampleCount": 8,
  "vectorSize": 4,
  "means": [
    11.5,
    12.5,
    9.25,
    0.0
  ],
  "stdDevs": [
    6.87386354243376,
    6.87386354243376,
    7.361215932167728,
    0.0
  ],
  "variances": [
    47.25,
    47.25,
    54.1875,
    0.0
  ],
  "mins": [
    1.0,
    2.0,
    1.0,
    0.0
  ],
  "maxs": [
    22.0,
    23.0,
    21.0,
    0.0
  ],
  "covariance": [
    [
      47.25,
      47.25,
      19.125,
      0.0
    ],
    [
      47.25,
      47.25,
      19.125,
      0.0
    ],
    [
      19.125,
      19.125,
      54.1875,
      0.0
    ],
    [
      0.0,
      0.0,
      0.0,
      0.0
    ]
  ],
  "correlations": [
    [
      1.0,
      1.0,
      0.37796447300922725,
      0.0
    ],
    [
      1.0,
      1.0,
      0.37796447300922725,
      0.0
    ],
    [
      0.37796447300922725,
      0.37796447300922725,
      1.0,
      0.0
    ],
    [
      0.0,
      0.0,
      0.0,
      1.0
    ]
  ],
  "sums": [
    92.0,
    100.0,
    74.0,
    0.0
  ],
  "weightSum": 8.0,
  "sumProducts": [
    [
      1436.0,
      1528.0,
      1004.0,
      0.0
    ],
    [
      1528.0,
      1628.0,
      1078.0,
      0.0
    ],
    [
      1004.0,
      1078.0,
      1118.0,
      0.0
    ],
    [
      0.0,
      0.0,
      0.0,
      0.0
    ]
  ],
  "isSummarized": true
}

Response parameters

Parameter	Description
`sampleCount`	The number of samples.
`vectorSize`	The length of the vector.
`means`	The average value of each component across all vectors.
`stdDevs`	The standard deviation of each component across all vectors.
`variances`	The variance of each component across all vectors.
`mins`	The minimum value of each component across all vectors.
`maxs`	The maximum value of each component across all vectors.
`covariance`	The covariance matrix between the components of all vectors.
`correlations`	The correlation coefficient matrix between the components of all vectors.
`sums`	The sum of each component across all vectors.
`weightSum`	The sum of all sample weights.
`sumProducts`	An intermediate result used when merging statistical patterns.
`isSummarized`	Indicates whether the statistical pattern calculation returned normally. true: The request was successful. false: The request failed.

merge_summary function

You can use the summarize function to merge patterns learned at different stages, such as patterns learned from the same dataset at different times or patterns from two independent datasets.

varchar merge_summary(varchar summary1, varchar summary2)

varchar merge_summary(varchar summary1, double weight1, varchar summary2, double weight2)

Parameter	Description
`summary1`	A multivariate pattern that is returned by the summarize function. For more information, see summarize function.
`weight1`	The overall weight for the summary1 pattern.
`summary2`	The pattern is derived from the summarize function.
`weight2`	Specifies the overall weight for the summary2 pattern.

Example

Query statement

* | with data_table_01 as
(
    select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
    select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
    select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
    select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features
),

summaries_01 as
(
    select entity_group,
        summarize(array_agg(features)) as statistical_summary
    from data_table_01
    group by entity_group
),

data_table_02 as
(
    select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
    select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
    select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
    select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
),

summaries_02 as
(
    select entity_group,
        summarize(array_agg(features)) as statistical_summary
    from data_table_02
    group by entity_group
)
select s1.entity_group,
    merge_summary(s1.statistical_summary, s2.statistical_summary) as statistical_summary
from summaries_01 as s1
    join summaries_02 as s2
    on s1.entity_group = s2.entity_group

Query and analysis results

statistical_summary is the aggregation pattern.

entity_group

statistical_summary

{
  "sampleCount": 8,
  "vectorSize": 4,
  "means": [
    11.5,
    12.5,
    9.25,
    0.0
  ],
  "stdDevs": [
    6.87386354243376,
    6.87386354243376,
    7.361215932167728,
    0.0
  ],
  "variances": [
    47.25,
    47.25,
    54.1875,
    0.0
  ],
  "mins": [
    1.0,
    2.0,
    1.0,
    0.0
  ],
  "maxs": [
    22.0,
    23.0,
    21.0,
    0.0
  ],
  "covariance": [
    [
      47.25,
      47.25,
      19.125,
      0.0
    ],
    [
      47.25,
      47.25,
      19.125,
      0.0
    ],
    [
      19.125,
      19.125,
      54.1875,
      0.0
    ],
    [
      0.0,
      0.0,
      0.0,
      0.0
    ]
  ],
  "correlations": [
    [
      1.0,
      1.0,
      0.37796447300922725,
      0.0
    ],
    [
      1.0,
      1.0,
      0.37796447300922725,
      0.0
    ],
    [
      0.37796447300922725,
      0.37796447300922725,
      1.0,
      0.0
    ],
    [
      0.0,
      0.0,
      0.0,
      1.0
    ]
  ],
  "sums": [
    92.0,
    100.0,
    74.0,
    0.0
  ],
  "weightSum": 8.0,
  "sumProducts": [
    [
      1436.0,
      1528.0,
      1004.0,
      0.0
    ],
    [
      1528.0,
      1628.0,
      1078.0,
      0.0
    ],
    [
      1004.0,
      1078.0,
      1118.0,
      0.0
    ],
    [
      0.0,
      0.0,
      0.0,
      0.0
    ]
  ],
  "isSummarized": true
}

Return parameters:

Parameter	Description
`sampleCount`	The number of samples.
`vectorSize`	The length of the vector.
`means`	The average value of each component across all vectors.
`stdDevs`	The standard deviation of each component across all vectors.
`variances`	The variance of each component across all vectors.
`mins`	The minimum value of each component across all vectors.
`maxs`	The maximum value of each component across all vectors.
`covariance`	The covariance matrix between the components of all vectors.
`correlations`	The correlation coefficient matrix between the components of all vectors.
`sums`	The sum of each component across all vectors.
`weightSum`	The sum of all sample weights.
`sumProducts`	An intermediate result used when merging statistical patterns.
`isSummarized`	Indicates whether the statistical pattern calculation returned normally. true: The request was successful. false: The request failed.

normalize_vector function

You can use the multivariate pattern summary obtained from the summarize function to normalize the new sample vector x_vector, which maps each of its components to the interval [0, 1].

array(double) normalize_vector(varchar summary, array(double) x_vector)

Parameter	Description
`summary`	The pattern is derived from the learning process of the summarize function.
`x_vector`	The new sample data.

Example

Query statement

* | with data_table as
(
    select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
    select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
    select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
    select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
    select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
    select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
    select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
    select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
),

summaries as
(
    select entity_group,
        summarize(array_agg(features)) as statistical_summary
    from data_table
    group by entity_group
)

select t1.entity_id,
    t1.entity_group,
    normalize_vector(t2.statistical_summary, t1.features) as normalized_features
from data_table as t1
    join summaries as t2
    on t1.entity_group = t2.entity_group

Query and analysis results

The normalized_features parameter indicates the normalization results of the sample vector specified by the x_vector parameter.

entity_id	entity_group	normalized_features
2	A	[0.14285714285714286,0.14285714285714286,0.25,0.5]
4	A	[0.42857142857142857,0.42857142857142857,0.0,0.5]
3	A	[0.2857142857142857,0.2857142857142857,0.4,0.5]
...	...	...

standardize_vector function

Use the multivariate pattern summary from the summarize function to standardize the new sample vector x_vector so that its components have a mean of 0 and a standard deviation of 1.

array(double) standardize_vector(varchar summary, array(double) x_vector)

Parameter	Description
`summary`	The pattern is derived from the learning process of the summarize function.
`x_vector`	The new sample data.

Example

Query statement

* | with data_table as
(
    select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
    select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
    select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
    select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
    select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
    select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
    select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
    select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
),

summaries as
(
    select entity_group,
        summarize(array_agg(features)) as statistical_summary
    from data_table
    group by entity_group
)

select t1.entity_id,
    t1.entity_group,
    standardize_vector(t2.statistical_summary, t1.features) as standardized_features
from data_table as t1
    join summaries as t2
    on t1.entity_group = t2.entity_group

Query and analysis results

The standardized_features parameter indicates the standardization results of the sample vector specified by the x_vector parameter.

entity_id	entity_group	standardized_features
2	A	[-1.0910894511799619,-1.0910894511799619,-0.4415031470273609,0.0]
4	A	[-0.21821789023599237,-0.21821789023599237,-1.1207387578386854,0.0]
3	A	[-0.6546536707079771,-0.6546536707079771,-0.03396178054056622,0.0]
...	...	...

mah_distance function

The mah_distance function calculates the Mahalanobis distance for a new sample vector specified by the x_vector parameter based on a pattern specified by the summary parameter. You can set the summary parameter to a pattern that is returned by the summarize function. For more information, see summarize function. The Mahalanobis distance accounts for scale differences between variables and measures the distance between the standardized sample vector and the centroid. A Mahalanobis distance of 1 means the sample vector is at the average distance from the centroid across all vectors.

double mah_distance(varchar summary, array(double) x_vector)

Parameter	Description
`summary`	The pattern is derived from the learning process of the summarize function.
`x_vector`	The new sample data.

Example

Query statement

* | with data_table as
(
    select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
    select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
    select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
    select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
    select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
    select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
    select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
    select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
),

summaries as
(
    select entity_group,
        summarize(array_agg(features)) as statistical_summary
    from data_table
    group by entity_group
)

select t1.entity_id,
    t1.entity_group,
    mah_distance(t2.statistical_summary, t1.features) as std_distance
from data_table as t1
    join summaries as t2
    on t1.entity_group = t2.entity_group

Return value

The std_distance parameter indicates the Mahalanobis distance of the sample vector specified by the x_vector parameter.

entity_id	entity_group	std_distance
8	A	2.386927730244857
7	A	1.6809080087793125
1	A	1.5554594371997328
...	...	...

standard_distance function

The standard_distance function calculates the standardized distance for a metric specified by the metric_value parameter based on a pattern specified by the summary parameter. You can set the summary parameter to a pattern that is returned by the summarize function. For more information, see summarize function. Unlike the Mahalanobis distance, which measures the standardized distance between a multi-metric vector and its centroid, the standardized distance measures the distance for a single metric within the vector. The element_index parameter specifies the index of the metric (starting from 0). The metric_value parameter specifies the value of the metric.

double standard_distance(varchar summary, double metric_value, int element_index)

Parameter	Description
`summary`	A pattern learned by the summarize function.
`metric_value`	New sample data
`element_index`	The index of a specified element in the `summary` array. The first element has an index of 0.

Example

Query statement

* | with data_table as
(
    select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
    select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
    select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
    select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
    select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
    select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
    select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
    select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
),

summaries as
(
    select entity_group,
        summarize(array_agg(features)) as statistical_summary
    from data_table
    group by entity_group
)

select t1.entity_id,
    t1.entity_group,
    standard_distance(t2.statistical_summary, 30, 1) as std_distance
from data_table as t1
    join summaries as t2
    on t1.entity_group = t2.entity_group

Query and analysis results

std_distance is the standardized distance of the input sample metric_value at the specified index.

entity_id	entity_group	std_distance
8	A	2.386927730244857
7	A	1.6809080087793125
1	A	1.5554594371997328
...	...	...

anomaly_level function

The anomaly_level function calculates the Mahalanobis distance for a new sample vector specified by the x_vector parameter based on a pattern specified by the summary parameter and rounds down each distance value to obtain different anomaly probability levels. You can set the summary parameter to a pattern that is returned by the summarize function. For more information, see summarize function. A return value of 0.1 indicates a 10% abnormal probability (first-level anomaly). A return value of 0.01 indicates a 1% probability (second-level). A return value of 0.001 indicates 0.1% (third-level). A return value of 0.0001 indicates 0.01% (fourth-level). Higher anomaly levels correspond to lower probabilities and greater suspicion of anomalies. You can configure a threshold to filter results, for example, retaining only fourth-level anomalies and above.

If you specify the element_index parameter, the function calculates the anomaly probability only for the component at that index. Otherwise, it calculates the anomaly probability for the entire vector.

double anomaly_level(varchar summary, array(double) x_vector)

double anomaly_level(varchar summary, array(double) x_vector, int element_index)

Parameter	Description
`summary`	The summarize function uses a learning process to generate a pattern.
`x_vector`	The new sample data.
`element_index`	Optional. The element at the specified index in the `x_vector` array.

Example

Query statement

* |
with dummy as
(
    select sequence(1, 1000) as seq_data,
        count(*) as record_count from log
),

sample_data as
(
    select 'G1' as group_id,
        s.seq_num,
         -- Generate 1,000 two-dimensional random vectors that are distributed around the (100, 5000) range. The standard deviation values of two components are 20 and 500.
        inverse_normal_cdf(100, 20, random()) as x1,
        inverse_normal_cdf(5000, 500, rand()) as x2
    from dummy,
        unnest(seq_data) as s(seq_num)
),

data_summary as
(
    select group_id,
        summarize(array_agg(array[x1, x2])) as metric_summary
    from sample_data
    group by group_id
),

new_data as
(
    select 'G1' as group_id, 1001 as object_id, 100.0 as x1, 5000.0 as x2
    union all select 'G1' as group_id, 1002 as object_id, 118.0 as x1, 5450.0 as x2
    union all select 'G1' as group_id, 1003 as object_id, 138.0 as x1, 5950.0 as x2
    union all select 'G1' as group_id, 1004 as object_id, 158.0 as x1, 6450.0 as x2
    union all select 'G1' as group_id, 1005 as object_id, 178.0 as x1, 6950.0 as x2
    union all select 'G1' as group_id, 1006 as object_id, 198.0 as x1, 7450.0 as x2
    union all select 'G1' as group_id, 1007 as object_id, 318.0 as x1, 10000.0 as x2
)

select n.group_id,
    json_extract(s.metric_summary, '$.means') as metric_vector_mean,
    json_extract(s.metric_summary, '$.covariance') as metric_covariance,
    n.object_id,
    n.x1,
    n.x2,
    anomaly_level(s.metric_summary, array[x1, x2]) as anomaly_level
from data_summary as s
    join new_data as n
        on s.group_id = n.group_id
order by n.group_id, n.object_id
limit 100000

Query and analysis results

The anomaly_level parameter indicates the abnormal probability of the sample vector specified by the x_vector parameter.

group_id	object_id	anomaly_level
G1	1007	13.0
G1	1006	5.0
G1	1005	4.0
...	...	...