Density-Based Clustering of Applications with Noise (DBSCAN) is a density-based clustering algorithm. A cluster is defined as a maximum set of densely connected points. The algorithm considers regions with high density to be clusters, and detects clusters of arbitrary shapes in spatial databases with noise. You can use the DBSCAN training model of the DBSCAN Prediction component to predict the clusters to which new points may belong. This topic describes how to configure the DBSCAN Prediction component.
Limits
You can use the DBSCAN Prediction component based only on the computing resources of MaxCompute and Flink of PAI.
Configure the component in the PAI console
You can configure parameters for the DBSCAN Prediction component in the Machine Learning Platform for AI (PAI) console.
Tab | Parameter | Description |
Field Setting | reservedCols | The columns to be reserved by the algorithm. |
Parameter Setting | predictionCol | The name of the prediction column. |
predictionDetailCol | The name of the prediction details column. | |
numThreads | The number of threads of the component. Default value: 1. | |
Execution Tuning | Number of Workers | The number of cores. This parameter must be used together with the Memory per worker, unit MB parameter. The value of this parameter must be a positive integer. Valid values: [1,9999]. For more information, see the "Appendix: How to estimate resource usage" section of this topic. |
Memory per worker, unit MB | The memory size of each core. Valid values: 1024 to 64 × 1024. Unit: MB. For more information, see the "Appendix: How to estimate resource usage" section of this topic. |
Appendix: How to estimate resource usage
- How do I estimate the memory to be used by each node?
The memory used by each node is approximately the model size times 30.
For example, if the input model size is 1 GB, the memory of each node can be set to 30 GB.
- How do I estimate the number of nodes that I need?
The distributed training task speeds up and then slows down as the number of nodes increases due to communication overhead. If the task slows down, stop increasing the node quantity. This node quantity can be used.