All Products
Search
Document Center

Platform For AI:GMM Training

Last Updated:Feb 19, 2024

Gaussian mixture models (GMMs) are a probabilistic model that helps to represent K Gaussian subpopulations within a complete population. You can use the GMM Training component to classify models. This topic describes how to configure the GMM Training component.

Limits

You can use the Ridge Regression Training component based only on one of the following computing resources: MaxCompute, Realtime Compute for Apache Flink, or Deep Learning Containers (DLC) of Platform for AI (PAI).

Configure the component in the PAI console

You can configure parameters for the GMM Training component in the Machine Learning Platform for AI (PAI) console.

Tab

Parameter

Description

Field Setting

vectorCol

The name of the vector column.

Parameter Setting

epsilon

The convergence threshold. When the distance between two core points generated from two iterations is less than the value specified for this parameter, the algorithm converges. Default value: 1.0E to 4.

k

The number of Gaussians. Default value: 2.

maxIter

The maximum number of iterations. Default value: 100.

randomSeed

The random seed given to the method. Default value: 0.

Execution Tuning

Number of Workers

The number of workers. This parameter must be used together with the Memory per worker, unit MB parameter. The value of this parameter must be a positive integer. Valid values: [1,9999]. For more information, see the "Appendix: How to estimate the resource usage" section of this topic.

Memory per worker, unit MB

The memory size of each worker. Valid values: 1024 to 64 × 1024. Unit: MB. For more information, see the "Appendix: How to estimate the resource usage" section of this topic.

Appendix: How to estimate the resource usage

You can refer to the following section to estimate the resource usage.

  • How do I estimate the appropriate memory size for each worker?

    If the number of Gaussians is K and the number of vector dimensions is M, the appropriate memory size for each worker can be calculated by using the following formula: M × M × K × 8 × 2 × 12/1024/1024 (unit: MB). In most cases, the memory size of each worker is set to 8 GB.

  • How do I estimate the appropriate worker quantity?

    We recommend that you configure the number of workers based on the input data size. For example, if the input data size is X GB, we recommend that you use 5 × X workers. If resources are insufficient, you can reduce the number of workers. A larger number of workers causes higher overheads for cross-worker communication. Therefore, as you increase the number of nodes, the distributed training task will first speed up but become slower after a specific number of workers. You can tune this parameter to find the optimal number.

  • How do I estimate the maximum amount of data that can be supported by the algorithm?

    We recommend that you set the number of vector dimensions to less than 200.