The IForest Anomaly Detection component identifies abnormal data points using a subsampling algorithm. This method reduces computational complexity while maintaining high detection effectiveness.
Configure the component
Configure the IForest Anomaly Detection component using one of these methods:
Use Designer UI
Configure the component parameters on the workflow page in Designer.
|
Tab |
Parameter |
Description |
|
Fields Setting |
Feature Columns |
Cannot be configured if Vector Column or Tensor Column is set. Feature columns for training. Note
Feature Columns, Tensor Column, and Vector Column are mutually exclusive. Use only one parameter to specify input features. |
|
Group Columns |
Columns for grouping data. |
|
|
Tensor Column |
Cannot be configured if Vector Column or Feature Columns is set. Name of the tensor column. Note
Feature Columns, Tensor Column, and Vector Column are mutually exclusive. Use only one parameter to specify input features. |
|
|
Vector Column |
Cannot be configured if Tensor Column or Feature Columns is set. Name of the vector column. Note
Feature Columns, Tensor Column, and Vector Column are mutually exclusive. Use only one parameter to specify input features. |
|
|
Parameter Settings |
Prediction Result Column |
Name of the prediction result column. |
|
Maximum Number of Outliers per Group |
Maximum number of outliers to detect in each group. |
|
|
Maximum Ratio of Outliers |
Maximum ratio of outliers that the algorithm can detect. |
|
|
Maximum Number of Samples per Group |
Maximum number of samples in each group. |
|
|
Number of Trees in the Model |
Number of trees in the model. Default: 100. |
|
|
Outlier Score Threshold |
Data points with scores greater than this threshold are identified as outliers. |
|
|
Prediction Details Column |
Name of the column that stores prediction details. |
|
|
Number of Rows Sampled per Tree |
Number of rows to sample for each tree. Must be a positive integer in the range [2, 100000]. Default: 256. |
|
|
Number of Threads |
Number of threads for the component. Default: 1. |
|
|
Execution Tuning |
Number of Workers |
Number of workers. Used with Memory per Worker. Must be a positive integer in the range [1, 9999]. |
|
Memory per Worker (MB) |
Memory size of each worker, in MB. Must be a positive integer in the range [1024, 65536]. |
Use Python code
Configure component parameters using the PyAlink Script component, which allows calling Python code. For more information, see PyAlink Script.
|
Parameter |
Required |
Description |
Default |
|
predictionCol |
Yes |
Name of the prediction result column. |
N/A |
|
featureCols |
No |
Names of the feature columns. Array type. |
Select All |
|
groupCols |
No |
Names of the group columns. Multiple columns supported. |
None |
|
maxOutlierNumPerGroup |
No |
Maximum number of outliers in each group. |
None |
|
maxOutlierRatio |
No |
Maximum ratio of outliers that the algorithm can detect. |
None |
|
maxSampleNumPerGroup |
No |
Maximum number of samples in each group. |
None |
|
numTrees |
No |
Number of trees in the model. |
100 |
|
outlierThreshold |
No |
Data points with scores greater than this threshold are identified as outliers. |
None |
|
predictionDetailCol |
Yes |
Name of the column that contains prediction details. |
N/A |
|
tensorCol |
No |
Tensor column name. |
None |
|
vectorCol |
No |
Name of the vector column. |
None |
|
subsamplingSize |
No |
Number of rows sampled for each tree. Must be a positive integer. Range: [2, 100000]. |
256 |
|
numThreads |
No |
Number of threads for the component. |
1 |
Example code:
from pyalink.alink import *
import pandas as pd
df = pd.DataFrame([
[0.73, 0],
[0.24, 0],
[0.63, 0],
[0.55, 0],
[0.73, 0],
[0.41, 0]
])
dataOp = BatchOperator.fromDataframe(df, schemaStr='val double, label int')
outlierOp = IForestOutlierBatchOp()\
.setFeatureCols(["val"])\
.setOutlierThreshold(3.0)\
.setPredictionCol("pred")\
.setPredictionDetailCol("pred_detail")
outlierOp.print()