Evaluate Multiclass Classification Models with PAI Designer - Platform for AI

The Multiclass Classification Evaluation component measures how well your model distinguishes between three or more classes. It reports accuracy, recall, F1 score, and a confusion matrix—both per class and as overall averages—so you can identify which classes the model struggles with and guide your next optimization step.

Configure the component

Method 1: Configure on the pipeline page

In Machine Learning Designer in the Platform for AI (PAI) console, add the Multiclass Classification Evaluation component to your pipeline and set the following parameters.

Tab	Parameter	Description
Fields Setting	Original Classification Result Column	The label column containing the actual class for each sample. Supports up to 1,000 distinct classes.
	Predicted Classification Result Column	The column of predicted class labels. Typically set to `prediction_result`.
	Advanced Options	When selected, activates the Predicted Classification Result Column field.
	Prediction Result Probability Column	The column used to calculate log loss. Typically set to `prediction_detail`. Valid only for random forest models—configuring it for other model types may cause an error.
Tuning	Cores	Number of CPU cores to allocate. Determined by the system by default. Must be set together with Memory Size per Core.
	Memory Size per Core	Memory allocated per core, in MB. Determined by the system by default.

Method 2: Use PAI commands

Run PAI commands through the SQL Script component. For details on calling PAI commands from a SQL Script component, see Scenario 4: Execute PAI commands within the SQL script component.

PAI -name MultiClassEvaluation -project algo_public
    -DinputTableName="test_input"
    -DoutputTableName="test_output"
    -DlabelColName="label"
    -DpredictionColName="prediction_result"
    -Dlifecycle=30;

The following table describes all available parameters.

Parameter	Required	Default	Description
`inputTableName`	Yes	—	Name of the input table.
`inputTablePartitions`	No	Full table	Partitions to read from the input table.
`outputTableName`	Yes	—	Name of the output table.
`labelColName`	Yes	—	Column name for the actual class labels in the input table.
`predictionColName`	Yes	—	Column name for the predicted class labels.
`predictionDetailColName`	No	—	Column name for the predicted class probabilities. Example value: `{"A":0.2,"B":0.3,"C":0.5}`.
`lifecycle`	No	—	Retention period of the output table, in days.
`coreNum`	No	System-determined	Number of CPU cores to allocate.
`memSizePerCore`	No	System-determined	Memory per core, in MB.

Example

This example creates a small dataset with two classes (A and B), runs the evaluation, and examines the output.

Step 1: Create sample data

Add a SQL Script component to the canvas and run the following SQL to generate a test table with 10 rows.

drop table if exists multi_esti_test;
create table multi_esti_test as
select * from
(
  select '0' as id, 'A' as label, 'A' as prediction, '{"A": 0.6, "B": 0.4}' as detail
  union all
  select '1' as id, 'A' as label, 'B' as prediction, '{"A": 0.45, "B": 0.55}' as detail
  union all
  select '2' as id, 'A' as label, 'A' as prediction, '{"A": 0.7, "B": 0.3}' as detail
  union all
  select '3' as id, 'A' as label, 'A' as prediction, '{"A": 0.9, "B": 0.1}' as detail
  union all
  select '4' as id, 'B' as label, 'B' as prediction, '{"A": 0.2, "B": 0.8}' as detail
  union all
  select '5' as id, 'B' as label, 'B' as prediction, '{"A": 0.1, "B": 0.9}' as detail
  union all
  select '6' as id, 'B' as label, 'A' as prediction, '{"A": 0.52, "B": 0.48}' as detail
  union all
  select '7' as id, 'B' as label, 'B' as prediction, '{"A": 0.4, "B": 0.6}' as detail
  union all
  select '8' as id, 'B' as label, 'A' as prediction, '{"A": 0.6, "B": 0.4}' as detail
  union all
  select '9' as id, 'A' as label, 'A' as prediction, '{"A": 0.75, "B": 0.25}' as detail
)tmp;

Step 2: Run the evaluation

Add another SQL Script component and run the following PAI command.

drop table if exists ${o1};
PAI -name MultiClassEvaluation -project algo_public
    -DinputTableName="multi_esti_test"
    -DoutputTableName=${o1}
    -DlabelColName="label"
    -DpredictionColName="prediction"
    -Dlifecycle=30;

Step 3: View the results

Right-click the SQL Script component and choose View Data > SQL Script Output.

The output is a JSON object. The key sections are described in Interpret the output below.

Interpret the output

The output JSON contains three logical groups: per-class metrics, overall averages, and distribution statistics.

Per-class metrics

LabelMeasureList reports one set of metrics for each class in LabelList. The table below shows values from the example above.

Metric	Class A	Class B	Range	Direction	What it means
Accuracy	0.70	0.70	[0, 1]	Higher is better	Proportion of all samples correctly classified for this class
Precision	0.67	0.75	[0, 1]	Higher is better	Of all samples predicted as this class, how many actually belong to it
Sensitivity (recall)	0.80	0.60	[0, 1]	Higher is better	Of all samples that actually belong to this class, how many were correctly identified
F1 score	0.73	0.67	[0, 1]	Higher is better	Harmonic mean of precision and recall; useful when both matter equally
Specificity	0.60	0.80	[0, 1]	Higher is better	Proportion of negative samples correctly rejected for this class
False positive rate	0.40	0.20	[0, 1]	Lower is better	Proportion of actual negatives incorrectly predicted as this class
False negative rate	0.20	0.40	[0, 1]	Lower is better	Proportion of actual positives missed for this class
False discovery rate	0.33	0.25	[0, 1]	Lower is better	Proportion of positive predictions that are incorrect
Negative predictive value	0.75	0.67	[0, 1]	Higher is better	Of all samples predicted as negative for this class, how many truly are
Kappa	0.40	0.40	[-1, 1]	Higher is better	Agreement between predictions and actual labels, adjusted for chance (> 0.6 is generally considered good)

Overall averages

OverallMeasures reports three averaging strategies across all classes. Use the one that fits your class distribution:

Strategy	Key in output	When to use
Macro-averaged	`MacroAveraged`	Classes are roughly balanced, or you want minority classes to have equal weight. When classes are imbalanced, use this to avoid majority classes dominating the score.
Micro-averaged	`MicroAveraged`	You have many more samples in some classes and want larger classes to contribute more to the overall score.
Label frequency-based micro	`LabelFrequencyBasedMicro`	Weighted by label frequency; in a balanced dataset this equals micro-averaged.

For this example, all three strategies produce an overall accuracy of 0.70 and a kappa of 0.40, because the two classes have equal sample counts (5 each).

When your classes are imbalanced, macro-averaged and micro-averaged results will differ. Focus on macro-averaged metrics to give equal weight to underrepresented classes.

Confusion matrix

ConfusionMatrix is a 2D array where ConfusionMatrix[i][j] is the number of samples from actual class i predicted as class j. For this example:

	Predicted A	Predicted B
Actual A	4 (TP)	1 (FN)
Actual B	2 (FP)	3 (TN)

ProportionMatrix shows the same data as row-normalized proportions (each row sums to 1.0).

Distribution statistics

Field	Description
`ActualLabelFrequencyList`	Sample count per class in the input data: [5, 5]
`ActualLabelProportionList`	Proportion per class in the input data: [0.5, 0.5]
`PredictedLabelFrequencyList`	Sample count per class in the predictions: [6, 4]
`PredictedLabelProportionList`	Proportion per class in the predictions: [0.6, 0.4]

A significant difference between actual and predicted distributions indicates systematic bias toward certain classes.

Appendix

If you run the component on the pipeline page, right-click the Multiclass Classification Evaluation component and select Visual Analysis to view the results in chart form.