All Products
Search
Document Center

Platform For AI:LOF Outlier

Last Updated:Mar 21, 2024

The LOF Outlier component of Platform for AI (PAI) identifies samples as outliers based on the Local Outlier Factor (LOF) algorithm. This topic describes how to configure the LOF Outlier component.

Limits

You can use the LOF Outlier component based only on the computing resources of MaxCompute.

Configure the component

You can use one of the following methods to configure the LOF Outlier component.

Method 1: Configure the component in the PAI console

Configure the component on the pipeline page of Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Field Setting

featureCols

An array of the names of feature columns.

groupCols

An array of the names of group columns.

tensorCol

The tensor column.

vectorCol

The name of the vector column.

Parameter Setting

Prediction Result Column

The name of the prediction result column.

Distance Measurement Method

The distance measurement used for clustering. Default value: EUCLIDEAN. Valid values:

  • EUCLIDEAN

  • COSINE

  • INNERPRODUCT

  • CITYBLOCK

  • JACCARD

  • PEARSON

maxOutlierNumPerGroup

The maximum number of outliers per group.

maxOutlierRatio

The maximum ratio of outliers that are detected by the LOF algorithm.

maxSampleNumPerGroup

The maximum number of samples per group.

numNeighbors

The number of adjacent data points that are used in an LOF diagram. Default value: 5.

outlierThreshold

If the score exceeds the specified threshold, an outlier is detected.

Column name of detail prediction information

The name of the prediction details column.

numThreads

The number of threads of the LOF Outlier component. Default value: 1.

Execute Tuning

Number of Workers

The number of worker nodes. The value must be a positive integer. This parameter must be used together with the Memory per worker parameter. Valid values: 1 to 9999.

Memory per worker

The memory size of each worker node. Unit: MB. The value must be a positive integer. You must specify a value from 1024 to 65536.

Method 2: Configure the component by using Python code

Configure the LOF Outlier component parameter by using the PyAlink Script component to call Python code. For more information, see the PyAlink script documentation.

Parameter

Required

Description

Default value

predictionCol

Yes

The name of the prediction results column.

N/A

distanceType

No

The distance measurement used for clustering. Valid values:

  • EUCLIDEAN

  • COSINE

  • INNERPRODUCT

  • CITYBLOCK

  • JACCARD

  • PEARSON

EUCLIDEAN

featureCols

No

An array of the names of feature columns.

Select All

groupCols

No

The name of the group column. You can specify multiple columns.

N/A

maxOutlierNumPerGroup

No

The maximum number of outliers per group.

N/A

maxOutlierRatio

No

The maximum ratio of outliers that are detected by the LOF algorithm.

N/A

maxSampleNumPerGroup

No

The maximum number of samples per group.

N/A

outlierThreshold

No

If the score exceeds the specified threshold, the data point is considered an anomalous point.

N/A

predictionDetailCol

No

The name of the prediction details column.

N/A

tensorCol

No

The name of the tensor column.

N/A

vectorCol

No

The name of the vector column.

N/A

numNeighbors

No

The number of adjacent data points that are used in a LOF diagram.

5

numThreads

No

The number of threads of the LOF Outlier component.

1

Sample Python code:

import pandas as pd
df = pd.DataFrame([
[0.73, 0],
[0.24, 0],
[0.63, 0],
[0.55, 0],
[0.73, 0],
[0.41, 0]
])

dataOp = BatchOperator.fromDataframe(df, schemaStr='val double, label int')

outlierOp = LofOutlierBatchOp()\
    .setFeatureCols(["val"])\
    .setOutlierThreshold(3.0)\
    .setPredictionCol("pred")\
    .setPredictionDetailCol("pred_detail")

evalOp = EvalOutlierBatchOp()\
    .setLabelCol("label")\
    .setPredictionDetailCol("pred_detail")\
    .setOutlierValueStrings(["1"])

metrics = dataOp\
    .link(outlierOp)\
    .link(evalOp)\
    .collectMetrics()

print(metrics)