GMM Training - Platform For AI - Alibaba Cloud Documentation Center

Gaussian mixture models (GMMs) are a probabilistic model that helps to represent K Gaussian subpopulations within a complete population. You can use the GMM Training component to classify models. This topic describes how to configure the GMM Training component.

Limits

You can use the Ridge Regression Training component based only on one of the following computing resources: MaxCompute, Realtime Compute for Apache Flink, or Deep Learning Containers (DLC) of Platform for AI (PAI).

Configure the component in the PAI console

You can configure parameters for the GMM Training component in the Machine Learning Platform for AI (PAI) console.

Tab	Parameter	Description
Field Setting	vectorCol	The name of the vector column.
Parameter Setting	epsilon	The convergence threshold. When the distance between two core points generated from two iterations is less than the value specified for this parameter, the algorithm converges. Default value: 1.0E to 4.
	k	The number of Gaussians. Default value: 2.
	maxIter	The maximum number of iterations. Default value: 100.
	randomSeed	The random seed given to the method. Default value: 0.
Execution Tuning	Number of Workers	The number of workers. This parameter must be used together with the Memory per worker, unit MB parameter. The value of this parameter must be a positive integer. Valid values: [1,9999]. For more information, see the "Appendix: How to estimate the resource usage" section of this topic.
Execution Tuning	Memory per worker, unit MB	The memory size of each worker. Valid values: 1024 to 64 × 1024. Unit: MB. For more information, see the "Appendix: How to estimate the resource usage" section of this topic.

Appendix: How to estimate the resource usage

You can refer to the following section to estimate the resource usage.

How do I estimate the appropriate memory size for each worker?
If the number of Gaussians is K and the number of vector dimensions is M, the appropriate memory size for each worker can be calculated by using the following formula: M × M × K × 8 × 2 × 12/1024/1024 (unit: MB). In most cases, the memory size of each worker is set to 8 GB.
How do I estimate the appropriate worker quantity?
We recommend that you configure the number of workers based on the input data size. For example, if the input data size is X GB, we recommend that you use 5 × X workers. If resources are insufficient, you can reduce the number of workers. A larger number of workers causes higher overheads for cross-worker communication. Therefore, as you increase the number of nodes, the distributed training task will first speed up but become slower after a specific number of workers. You can tune this parameter to find the optimal number.
How do I estimate the maximum amount of data that can be supported by the algorithm?
We recommend that you set the number of vector dimensions to less than 200.