The Clustering Model Evaluation component is used to evaluate clustering models and generate evaluation metrics based on raw data and clustering results.

Limits

The report of this component is available only in the Machine Learning Studio console.

Background information

The Calinski-Harabasz index is also known as the variance ratio criterion (VRC). The following figure shows the formula used to calculate the VRC. Calculation formula of VRC
ParameterDescription
SSBThe variance between clusters. The following figure shows the formula used to calculate the variance between clusters. SSBFormula description:
  • k: indicates the number of cluster center points.
  • mi: indicates the center point of the cluster.
  • m: indicates the mean value of the input data.
SSWThe variance within a cluster. The following figure shows the formula used to calculate the variance within a cluster. SSWFormula description:
  • k: indicates the number of cluster center points.
  • x: indicates data points.
  • ci: indicates the ith cluster.
  • mi: indicates the center point of the cluster.
NThe total number of records.
kThe number of cluster center points.

Configure the component

You can use the following methods to configure the component parameters:

Method 1: Use Machine Learning Designer

Configure the component parameters on the pipeline configuration tab of Machine Learning Designer.
TabParameterDescription
Fields SettingEvaluation ColumnsThe columns that are selected from the input table for evaluation. The value of this parameter must be consistent with the feature columns in the model.
Input Sparse FormatSpecifies whether the input data is sparse. Sparse data is presented by using key-value pairs.
KV Pair DelimiterThe delimiter that is used to separate key-value pairs. By default, commas (,) are used.
KV DelimiterThe delimiter that is used to separate keys and values. By default, colons (:) are used.
TuningCoresThe number of cores. This parameter must be used together with the Memory Size per Core parameter. The value of this parameter must be a positive integer.
Memory Size per CoreThe memory size of each core. This parameter must be used together with the Cores parameter. Unit: MB.

Method 2: Use PAI commands

Configure the parameters of this component by using a Machine Learning Platform for AI (PAI) command. You can use the SQL Script component to call these commands. For more information, see SQL Script. The following table describes the parameters of the command.
PAI -name cluster_evaluation
    -project algo_public
    -DinputTableName=pai_cluster_evaluation_test_input
    -DselectedColNames=f0,f3
    -DmodelName=pai_kmeans_test_model
    -DoutputTableName=pai_ft_cluster_evaluation_out;
ParameterRequiredDescriptionDefault value
inputTableNameYesThe name of the input table. N/A
selectedColNamesNoThe names of the columns that are selected from the input table for evaluation. Separate multiple columns with commas (,). The value of this parameter must be consistent with the feature columns in the model. All columns
inputTablePartitionsNoThe partitions selected from the input table for training. Specify this parameter in one of the following formats:
  • Partition_name=value
  • name1=value1/name2=value2: multi-level partitions
Note If you specify multiple partitions, separate these partitions with commas (,).
Full table
enableSparseNoSpecifies whether the input data is sparse. Valid values: true and false. false
itemDelimiterNoThe delimiter that is used to separate sparse key-value pairs. ,
kvDelimiterNoThe delimiter that is used to separate sparse keys and values. :
modelNameYesThe name of the input clustering model. N/A
outputTableNameYesThe name of the output table. N/A
lifecycleNoThe lifecycle of the output table. N/A

Example

  1. Execute the following SQL statements to generate test data:
    create table if not exists pai_cluster_evaluation_test_input as
    select * from
    (
      select 1 as id, 1 as f0,2 as f3 from dual
      union all
      select 2 as id, 1 as f0,3 as f3 from dual
      union all
      select 3 as id, 1 as f0,4 as f3 from dual
      union all
      select 4 as id, 0 as f0,3 as f3 from dual
      union all
      select 5 as id, 0 as f0,4 as f3 from dual
    )tmp;
  2. Run the following PAI command to build a clustering model. A k-means clustering model is built in this example.
    PAI -name kmeans
        -project algo_public
        -DinputTableName=pai_cluster_evaluation_test_input
        -DselectedColNames=f0,f3
        -DcenterCount=3
        -Dloop=10
        -Daccuracy=0.00001
        -DdistanceType=euclidean
        -DinitCenterMethod=random
        -Dseed=1
        -DmodelName=pai_kmeans_test_model
        -DidxTableName=pai_kmeans_test_idx
  3. Run the following PAI command to submit the parameters configured for the Clustering Model Evaluation component:
    PAI -name cluster_evaluation
        -project algo_public
        -DinputTableName=pai_cluster_evaluation_test_input
        -DselectedColNames=f0,f3
        -DmodelName=pai_kmeans_test_model
        -DoutputTableName=pai_ft_cluster_evaluation_out;
  4. View the output evaluation table pai_ft_cluster_evaluation_out and the following visualized graph. Statistical resultsThe following table describes the fields displayed in the graph.
    FieldDescription
    countThe total number of returned entries.
    centerCountThe number of cluster centers.
    calinharaThe VRC.