what is the k-means clustering algorithm (k-means) - PolarDB

K-means is an iterative clustering algorithm built into PolarDB for MySQL. Use it to group rows in a table into K distinct clusters based on numeric feature columns — directly in your database, without exporting data to an external ML platform.

How it works

K-means partitions data through the following steps:

Divide the dataset into K groups and randomly select K rows as the initial centroids.
Calculate the distance from each row to every centroid.
Assign each row to its nearest centroid.
Recalculate each centroid as the mean of all rows assigned to it.
Repeat steps 2–4 until the centroids stabilize (assignments no longer change) or the algorithm reaches its iteration limit.

The algorithm uses the columns you specify in x_cols as features and outputs a cluster label for each row.

Use cases

K-means works well when you need to discover natural groupings in numeric data. Common applications include:

Document classification: Represent documents as vectors using term frequency, then cluster the vectors to identify groups of similar documents by topic or content.
Customer segmentation: Group customers by purchase history, interests, or activity data to identify segments for targeted campaigns. For example, segment telecom subscribers by payment behavior — top-up, test message sending, and website browsing.
Fraud detection: Cluster historical transaction records to identify patterns associated with fraudulent claims. Apply the model to flag new records that match known fraud clusters. This approach is used in automobile, medical insurance, and insurance fraud detection.
IT alert clustering: Cluster alerts generated by network, storage, or database infrastructure to identify alert categories and mean time to repair, and to predict cascading failures before they occur.
Call record analysis: Combine call detail records (CDRs) with customer profiles to cluster subscribers by usage behavior, helping predict service needs and reduce churn.
Crime pattern analysis: Cluster crime events by type and location to identify hotspots and assist investigation prioritization.

Parameters

Set the following parameter in the model_parameter option of your CREATE MODEL statement.

Parameter	Description	Default
`n_clusters`	Number of clusters (K).	`4`

Examples

The following examples use the db4ai.testdata1 table, which contains numeric columns dx1 and dx2.

Create a K-means model

/*polar4ai*/CREATE MODEL test_kmeans WITH
(model_class = 'kmeans', x_cols = 'dx1,dx2',
 y_cols='', model_parameter=(n_clusters=2))
 AS (SELECT * FROM db4ai.testdata1);

This statement trains a K-means model named test_kmeans that clusters rows into 2 groups based on the dx1 and dx2 columns.

Run predictions

After the model is trained, use PREDICT to get the cluster assignment for each row:

/*polar4ai*/SELECT dx1, dx2 FROM
PREDICT(MODEL test_kmeans,
SELECT * FROM db4ai.testdata1 LIMIT 10)
WITH (x_cols = 'dx1,dx2',
      y_cols='');

Each row in the result includes its original feature values (dx1, dx2) and a cluster label indicating which group it was assigned to.

Usage notes

Columns in x_cols must be floating-point or integer data.