K-means clustering randomly selects K objects as the initial centroids of each cluster, computes the distance between the remaining objects and the centroids, distributes the remaining objects to the nearest clusters, and then recalculates the centroids of each cluster. K-means clustering assumes that clustering objects are spatial vectors. K-means clustering minimizes the sum of the mean squared error (MSE) inside each cluster and constantly performs calculations and iterations until the criterion function converges.
Considerations
When you use the K-means Clustering component, you must take note of the following
items:
- If cosine is used, specific clusters may be empty. In this case, the number of clusters is less than K. K initial centroids may be parallel vectors. If the centroids are traversed in sequence, the sample is not distributed to the centroids that are parallel vectors. We recommend that you use the K centroids that you listed in the external centroid table.
- If the input table contains NULL or empty values, the system reports the following
error:
Algo Job Failed-System Error-Null feature value found
. We recommend that you use the default values for imputation. - If data in the sparse format is used as an input and the largest column ID exceeds
2000000, the system reports the following error:
Algo Job Failed-System Error-Feature count can't be more than 2000000
. We recommend that you renumber the columns from 0 or 1. - If a write operation fails due to a too large centroid model, the system reports the
following error:
Algo Job Failed-System Error-kIOError:Write failed for message: comparison_measure
. We recommend that you renumber the columns whose data is in the sparse format from 0 or 1. If the value ofcol*centerCount
is greater than 270000000, run commands to remove the modelName parameter, and then perform clustering again. - If the name of a column in the input table contains SQL keywords, the system reports
the following error:
FAILED: Failed Task createCenterTable:kOtherError:ODPS-0130161:[1,558] Parse exception - invalid token ',', expect ")"
.
Configure the component
You can configure the component by using one of the following methods:
- Use the Machine Learning Platform for AI console
Tab Parameter Description Fields Setting Feature Columns The feature columns. The columns of the DOUBLE and INT types are supported. Appending Columns The input columns that are appended to the clustering result table. Separate the column names with commas (,). Input in Sparse Matrix Specifies whether the input data is in the sparse format. Data in the sparse format is presented by using key-value pairs. KV Pair Delimiter The delimiter that is used to separate key-value pairs. Commas (,) are used by default. KV Delimiter The delimiter that is used to separate keys and values. Colons (:) are used by default. Parameters Setting Number of Clusters Valid values: 1 to 1000. Distance Measurement Method The method that is used to measure distances. Valid values: Euclidean, Cosine, and Cityblock. Centroid Initialization Method The method that is used to initialize centroids. Valid values: Random, First K, Uniform, K-Means++, and Use the initial centroid table. Maximum Iterations Valid values: 1 to 1000. Convergence Criterion The threshold to terminate iterations. Initial Random Seed By default, the current time is used. If this parameter uses a fixed value, the clustering result is stable. Tuning Core Quantity The number of cores. By default, the system determines the value. Memory Size per Core The memory size of each core. By default, the system determines the value. Unit: MB. - Use commands
pai -name kmeans -project algo_public -DinputTableName=pai_kmeans_test_input -DselectedColNames=f0,f1 -DappendColNames=f0,f1 -DcenterCount=3 -Dloop=10 -Daccuracy=0.01 -DdistanceType=euclidean -DinitCenterMethod=random -Dseed=1 -DmodelName=pai_kmeans_test_output_model_ -DidxTableName=pai_kmeans_test_output_idx -DclusterCountTableName=pai_kmeans_test_output_couter -DcenterTableName=pai_kmeans_test_output_center;
Parameter Required Description Default value inputTableName Yes The name of the input table. N/A selectedColNames No The columns that are selected from the input table for training. The column names must be separated by commas (,). The columns of the INT and DOUBLE types are supported. If the input data is in sparse format, columns of the STRING type are supported. All columns inputTablePartitions No The partitions that are selected from the input table for training. Specify this parameter in one of the following formats: - Partition_name=value
- name1=value1/name2=value2: multi-level partitions
Note If you specify multiple partitions, separate these partitions with commas (,).All partitions appendColNames No The input columns that are appended to the clustering result table. Separate the column names with commas (,). N/A enableSparse No Specifies whether the input data is in the sparse format. Valid values: true or false. false itemDelimiter No The delimiter that is used to separate key-value pairs. , kvDelimiter No The delimiter that is used to separate keys and values in key-value pairs. : centerCount Yes The number of clustering centroids. Valid values: 1 to 1000. 10 distanceType No The method that is used to measure distances. Valid values: - euclidean: the Euclidean distance that is calculated by using the following formula:
d (x - c) = (x - c) (x - c)'
- cosine: the cosine that is calculated by using the following formula:
- cityblock: the city block distance, which is also called the Manhattan distance. It is calculated
by using the following formula:
d (x - c) = | x - c |
euclidean initCenterMethod No The method that is used to initialize centroids. Valid values: - random: K initial centroids are randomly sampled from the input table. The initial random seed is specified by the seed parameter.
- topk: The first K rows in the input table are used as the initial centroids.
- uniform: K initial centroids are calculated from the minimum value to the maximum value. This ensures that these initial centroids are evenly distributed.
- kmpp: K initial centroids are obtained by using the k-means++ algorithm.
- external: the table that lists additional initial centroids.
random initCenterTableName No The name of the table that lists initial centroids. This parameter is valid only if the initCenterMethod parameter is set to external. N/A loop No The maximum number of iterations. Valid values: 1 to 1000. 100 accuracy No The conditions to terminate the algorithm. The algorithm is terminated if the objective difference between two iterations is less than the value of this parameter. 0.1 seed No The initial random seed. Current time modelName No The name of the output model. N/A idxTableName Yes The name of the clustering result table, which includes the ID of the cluster to which each record belongs after the clustering. N/A idxTablePartition No The partition in the clustering result table. N/A clusterCountTableName No The clustering statistics table that counts the number of points included in each cluster. N/A centerTableName No The clustering centroid table. N/A coreNum No The number of cores. This parameter must be used with the memSizePerCore parameter. The number of cores. Valid values: 1 to 9999. Determined by the system memSizePerCore No The memory size of each core. Valid values: 1024 to 64 x 1024. Unit: MB. Determined by the system lifecycle No The lifecycle of the output table,Unit: Day. N/A
Output
The output data of the K-means Clustering component includes the clustering result
table, clustering statistics table, and clustering centroid table. Output format:
- Clustering result table
Column Description appendColNames The names of the appended columns. cluster_index The cluster to which each sample is assigned in the training table. distance The distance from each sample to the cluster centroid in the training table. - Clustering statistics table
Column Description cluster_index The ID of the cluster. cluster_count The number of samples in each cluster. - Clustering centroid table
Column Description cluster_index The ID of the cluster. selectedColNames The columns that are selected from the training table for training.
Example
Input data in the dense format:
- You can generate test data by using one of the following methods:
- Use the initial centroid table
create table pai_kmeans_test_init_center as select * from ( select 1 as f0,2 as f1 from dual union all select 1 as f0,3 as f1 from dual union all select 1 as f0,4 as f1 from dual )tmp;
- Use other initial centroids
create table pai_kmeans_test_input as select * from ( select 'id1' as id,1 as f0,2 as f1 from dual union all select 'id2' as id,1 as f0,3 as f1 from dual union all select 'id3' as id,1 as f0,4 as f1 from dual union all select 'id4' as id,0 as f0,3 as f1 from dual union all select 'id5' as id,0 as f0,4 as f1 from dual )tmp;
- Use the initial centroid table
- Run PAI commands to submit the parameters of the K-means Clustering component.
- Use the initial centroid table
drop table if exists pai_kmeans_test_output_idx; yes drop table if exists pai_kmeans_test_output_couter; yes drop table if exists pai_kmeans_test_output_center; yes drop offlinemodel if exists pai_kmeans_test_output_model_; yes pai -name kmeans -project algo_public -DinputTableName=pai_kmeans_test_input -DinitCenterTableName=pai_kmeans_test_init_center -DselectedColNames=f0,f1 -DappendColNames=f0,f1 -DcenterCount=3 -Dloop=10 -Daccuracy=0.01 -DdistanceType=euclidean -DinitCenterMethod=external -Dseed=1 -DmodelName=pai_kmeans_test_output_model_ -DidxTableName=pai_kmeans_test_output_idx -DclusterCountTableName=pai_kmeans_test_output_couter -DcenterTableName=pai_kmeans_test_output_center;
- Use the initial centroids that are randomly selected
drop table if exists pai_kmeans_test_output_idx; yes drop table if exists pai_kmeans_test_output_couter; yes drop table if exists pai_kmeans_test_output_center; yes drop offlinemodel if exists pai_kmeans_test_output_model_; yes pai -name kmeans -project algo_public -DinputTableName=pai_kmeans_test_input -DselectedColNames=f0,f1 -DappendColNames=f0,f1 -DcenterCount=3 -Dloop=10 -Daccuracy=0.01 -DdistanceType=euclidean -DinitCenterMethod=random -Dseed=1 -DmodelName=pai_kmeans_test_output_model_ -DidxTableName=pai_kmeans_test_output_idx -DclusterCountTableName=pai_kmeans_test_output_couter -DcenterTableName=pai_kmeans_test_output_center;
- Use the initial centroid table
- View the clustering result table, clustering statistics table, and clustering centroid
table.
- Clustering result table specified by idxTableName
+------------+------------+---------------+------------+ | f0 | f1 | cluster_index | distance | +------------+------------+---------------+------------+ | 1 | 2 | 0 | 0.0 | | 1 | 3 | 1 | 0.5 | | 1 | 4 | 2 | 0.5 | | 0 | 3 | 1 | 0.5 | | 0 | 4 | 2 | 0.5 | +------------+------------+---------------+------------+
- Clustering statistics table specified by clusterCountTableName
+---------------+---------------+ | cluster_index | cluster_count | +---------------+---------------+ | 0 | 1 | | 1 | 2 | | 2 | 2 | +---------------+---------------+
- Clustering centroid table specified by centerTableName
+---------------+------------+------------+ | cluster_index | f0 | f1 | +---------------+------------+------------+ | 0 | 1.0 | 2.0 | | 1 | 0.5 | 3.0 | | 2 | 0.5 | 4.0 | +---------------+------------+------------+
- Clustering result table specified by idxTableName
Input data in the sparse format:
- Generate test data.
create table pai_kmeans_test_sparse_input as select * from ( select 1 as id,"s1" as id_s,"0:0.1,1:0.2" as kvs0,"2:0.3,3:0.4" as kvs1 from dual union all select 2 as id,"s2" as id_s,"0:1.1,2:1.2" as kvs0,"4:1.3,5:1.4" as kvs1 from dual union all select 3 as id,"s3" as id_s,"0:2.1,3:2.2" as kvs0,"6:2.3,7:2.4" as kvs1 from dual union all select 4 as id,"s4" as id_s,"0:3.1,4:3.2" as kvs0,"8:3.3,9:3.4" as kvs1 from dual union all select 5 as id,"s5" as id_s,"0:5.1,5:5.2" as kvs0,"10:5.3,6:5.4" as kvs1 from dual )tmp;
If input data is in the sparse format, 0 is used to impute the columns with missing values. If multiple columns are used as an input, these columns are merged. For example, if kvs0 and kvs1 are used as an input, the first row contains the following data:
In this example, the sparse matrix is numbered from 0, and has five rows and 11 columns. If a column in kvs contains0:0.1,1:0.2,2:0.3,3:0.4,4:0,5:0,6:0,7:0,8:0,9:0,10:0
123456789:0.1
, the sparse matrix has five rows and 123456789 columns. This matrix consumes large amounts of CPU and memory resources. If kvs contains the columns that are incorrectly numbered, we recommend that you renumber the columns to reduce the size of the matrix. - Run the following PAI command to submit the parameters of the K-means Clustering component:
pai -name kmeans -project algo_public -DinputTableName=pai_kmeans_test_sparse_input -DenableSparse=true -DselectedColNames=kvs0,kvs1 -DappendColNames=id,id_s -DitemDelimiter=, -DkvDelimiter=: -DcenterCount=3 -Dloop=100 -Daccuracy=0.01 -DdistanceType=euclidean -DinitCenterMethod=topk -Dseed=1 -DmodelName=pai_kmeans_test_input_sparse_output_model -DidxTableName=pai_kmeans_test_sparse_output_idx -DclusterCountTableName=pai_kmeans_test_sparse_output_couter -DcenterTableName=pai_kmeans_test_sparse_output_center;
- View the clustering result table, clustering statistics table, and clustering centroid
table.
- Clustering result table specified by idxTableName
+------------+------------+---------------+------------+ | id | id_s | cluster_index | distance | +------------+------------+---------------+------------+ | 4 | s4 | 0 | 2.90215437218629 | | 5 | s5 | 1 | 0.0 | | 1 | s1 | 2 | 0.7088723439378913 | | 2 | s2 | 2 | 1.1683321445547923 | | 3 | s3 | 0 | 2.0548722588034516 | +------------+------------+---------------+------------+
- Clustering statistics table specified by clusterCountTableName
+---------------+---------------+ | cluster_index | cluster_count | +---------------+---------------+ | 0 | 2 | | 1 | 1 | | 2 | 2 | +---------------+---------------+
- Clustering centroid table specified by centerTableName
+---------------+------------+------------+ | cluster_index | kvs0 | kvs1 | +---------------+------------+------------+ | 0 | 0:2.6,1:0,2:0,3:1.1,4:1.6,5:0 | 6:1.15,7:1.2,8:1.65,9:1.7,10:0 | | 1 | 0:5.1,1:0,2:0,3:0,4:0,5:5.2 | 6:5.4,7:0,8:0,9:0,10:5.3 | | 2 | 0:0.6,1:0.1,2:0.75,3:0.2,4:0.65,5:0.7 | 6:0,7:0,8:0,9:0,10:0 | +---------------+------------+------------+
- Clustering result table specified by idxTableName