Gaussian mixture models (GMMs) are a probabilistic model that helps to represent K Gaussian subpopulations within a complete population. You can use the GMM Training component to classify models. This topic describes how to configure the GMM Training component.

Limits

The supported computing engines are MaxCompute and Apache Flink.

Configure the component in the PAI console

You can configure parameters for the GMM Training component in the Machine Learning Platform for AI (PAI) console.
Tab Parameter Description
Field Setting vectorCol The name of the vector column.
Parameter Setting epsilon The convergence threshold. When the distance between two core points generated from two iterations is less than the value specified for this parameter, the algorithm converges. Default value: 1.0E to 4.
k The number of Gaussians. Default value: 2.
maxIter The maximum number of iterations. Default value: 100.
randomSeed The random seed given to the method. Default value: 0.
Execution Tuning Number of Workers The number of workers. This parameter must be used together with the Memory per worker, unit MB parameter. The value of this parameter must be a positive integer. Valid values: [1,9999]. For more information, see the "Appendix: How to estimate the resource usage" section of this topic.
Memory per worker, unit MB The memory size of each worker. Valid values: 1024 to 64 × 1024. Unit: MB. For more information, see the "Appendix: How to estimate the resource usage" section of this topic.

Configure the component by coding

You can also use the following Python or Java code to configure the component.
  • Python
    df_data = pd.DataFrame([
        ["-0.6264538 0.1836433"],
        ["-0.8356286 1.5952808"],
        ["0.3295078 -0.8204684"],
        ["0.4874291 0.7383247"],
        ["0.5757814 -0.3053884"],
        ["1.5117812 0.3898432"],
        ["-0.6212406 -2.2146999"],
        ["11.1249309 9.9550664"],
        ["9.9838097 10.9438362"],
        ["10.8212212 10.5939013"],
        ["10.9189774 10.7821363"],
        ["10.0745650 8.0106483"],
        ["10.6198257 9.9438713"],
        ["9.8442045 8.5292476"],
        ["9.5218499 10.4179416"],
    ])
    
    data = BatchOperator.fromDataframe(df_data, schemaStr='features string')
    dataStream = StreamOperator.fromDataframe(df_data, schemaStr='features string')
    
    gmm = GmmTrainBatchOp() \
        .setVectorCol("features") \
        .setEpsilon(0.)
    
    model = gmm.linkFrom(data)
    
    predictor = GmmPredictBatchOp() \
        .setPredictionCol("cluster_id") \
        .setVectorCol("features") \
        .setPredictionDetailCol("cluster_detail")
    
    predictor.linkFrom(model, data).print()
    
    predictorStream = GmmPredictStreamOp(model) \
        .setPredictionCol("cluster_id") \
        .setVectorCol("features") \
        .setPredictionDetailCol("cluster_detail")
    
    predictorStream.linkFrom(dataStream).print()
    
    StreamOperator.execute()
  • Java
    import org.apache.flink.types.Row;
    
    import com.alibaba.alink.operator.batch.BatchOperator;
    import com.alibaba.alink.operator.batch.clustering.GmmPredictBatchOp;
    import com.alibaba.alink.operator.batch.clustering.GmmTrainBatchOp;
    import com.alibaba.alink.operator.batch.source.MemSourceBatchOp;
    import com.alibaba.alink.operator.stream.StreamOperator;
    import com.alibaba.alink.operator.stream.clustering.GmmPredictStreamOp;
    import com.alibaba.alink.operator.stream.source.MemSourceStreamOp;
    import org.junit.Test;
    
    import java.util.Arrays;
    import java.util.List;
    
    public class GmmTrainBatchOpTest {
        @Test
        public void testGmmTrainBatchOp() throws Exception {
            List <Row> df_data = Arrays.asList(
                Row.of("-0.6264538 0.1836433"),
                Row.of("-0.8356286 1.5952808"),
                Row.of("0.3295078 -0.8204684"),
                Row.of("0.4874291 0.7383247"),
                Row.of("0.5757814 -0.3053884"),
                Row.of("1.5117812 0.3898432"),
                Row.of("-0.6212406 -2.2146999"),
                Row.of("11.1249309 9.9550664"),
                Row.of("9.9838097 10.9438362"),
                Row.of("10.8212212 10.5939013"),
                Row.of("10.9189774 10.7821363"),
                Row.of("10.0745650 8.0106483"),
                Row.of("10.6198257 9.9438713"),
                Row.of("9.8442045 8.5292476"),
                Row.of("9.5218499 10.4179416")
            );
            BatchOperator <?> data = new MemSourceBatchOp(df_data, "features string");
            StreamOperator <?> dataStream = new MemSourceStreamOp(df_data, "features string");
            BatchOperator <?> gmm = new GmmTrainBatchOp()
                .setVectorCol("features")
                .setEpsilon(0.);
            BatchOperator <?> model = gmm.linkFrom(data);
            BatchOperator <?> predictor = new GmmPredictBatchOp()
                .setPredictionCol("cluster_id")
                .setVectorCol("features")
                .setPredictionDetailCol("cluster_detail");
            predictor.linkFrom(model, data).print();
            StreamOperator <?> predictorStream = new GmmPredictStreamOp(model)
                .setPredictionCol("cluster_id")
                .setVectorCol("features")
                .setPredictionDetailCol("cluster_detail");
            predictorStream.linkFrom(dataStream).print();
            StreamOperator.execute();
        }
    }

Appendix: How to estimate the resource usage

You can refer to the following section to estimate the resource usage.
  • How do I estimate the appropriate memory size for each worker?

    If the number of Gaussians is K and the number of vector dimensions is M, the appropriate memory size for each worker can be calculated by using the following formula: M × M × K × 8 × 2 × 12/1024/1024 (unit: MB). In most cases, the memory size of each worker is set to 8 GB.

  • How do I estimate the appropriate worker quantity?

    We recommend that you configure the number of workers based on the input data size. For example, if the input data size is X GB, we recommend that you use 5 × X workers. If resources are insufficient, you can reduce the number of workers. A larger number of workers causes higher overheads for cross-worker communication. Therefore, as you increase the number of nodes, the distributed training task will first speed up but become slower after a specific number of workers. You can tune this parameter to find the optimal number.

  • How do I estimate the maximum amount of data that can be supported by the algorithm?

    We recommend that you set the number of vector dimensions to less than 200.