Density-based Spatial Clustering of Applications with Noise (DBSCAN) is an unsupervised density-based clustering algorithm. DBSCAN uses specified radius and quantity thresholds to filter core points and neighbors in a region. DBSCAN uses density-reachable and density-connected theories to cluster data points.

Limits

  • This DBSCAN component can be used only in Machine Learning Studio.
  • In the input table of the DBSCAN component, the first column is the sample ID column that must be named mid and have integers that start with 0 as values. The second and later columns are vector dimension columns that must have f1, f2, or another value in the same format as the name.

DBSCAN

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI (PAI) console
    Tab Parameter Description
    Parameters Setting Input Data Type The type of the input data. Valid values: Adjacency List and Vector.
    Data Vector Dimension The vector dimension of the input data. This parameter is required only if the Input Data Type parameter is set to Vector.
    Note If the Input Table Format parameter is set to Multiple Columns, the value of the Data Vector Dimension parameter must be the same as that of the Data Columns (First Select ID Column) parameter.
    Neighborhood Point Distance Threshold If the distance between two points is less than the threshold, the points are considered to be neighbors. This parameter is required only if the Input Data Type parameter is set to Vector.
    Core Object Density Threshold If the number of other points in the neighborhood of a point is greater than the threshold specified by this parameter, the point is a core object.
    Input Table Format This parameter is required only if the Input Data Type parameter is set to Vector. Valid values:
    • Multiple Columns: Multiple columns are used to represent a vector.
    • Two Columns: A single column is used to represent a vector. Separate dimensions of the vector with commas (,).
    Fields Information Data Columns (First Select ID Column) This parameter is required only if the Input Table Format parameter is set to Multiple Columns.
    Tuning Servers The number of servers.
    Workers (>1) The number of workers.
    CPUs per Server The number of CPUs for each server.
    CPUs per Worker The number of CPUs for each worker.
    Memory Size per Worker The memory size of each worker. Unit: MB.
    Memory Size per Server The memory size of each server. Unit: MB.
  • PAI commands
    DBSCAN allows you to use a neighbor table or a vector as an input, and to use two or more columns to represent a vector. Sample commands:
    • Use a neighbor table as an input
      pai -name ps_dbscan
      -DinputTable=hxdb_neighbor_data_order
      -DinputType="1"
      -DoutputTable="hxtmp2"
      -DminPoints="4"
      -DserverNum="1"
      -DserverCpu="300"
      -DserverMemory="3000"
      -DworkerNum="2"
      -DworkerCpu="800"
      -DworkerMemory="2000"
    • Use a vector represented by multiple columns as an input
      pai -name ps_dbscan
      -DinputTable=hxdb_multicols_data
      -DinputType="0"
      -DoutputTable="hxtmp"
      -DdataType="DenseMultiCols"
      -DpointDim="12"
      -Deps="4"
      -DminPoints="20"
      -DselectedColIds="all"
      -DserverNum="2"
      -DserverCpu="300"
      -DserverMemory="3000"
      -DworkerNum="10"
      -DworkerCpu="800"
      -DworkerMemory="2000"
    • Use a vector represented by two columns as an input
      pai -name ps_dbscan
      -DinputTable="hxdb_sample_60w"
      -DinputType="0"
      -DoutputTable="hxtmp1"
      -DdataType="Dense2Cols"
      -DpointDim="2"
      -Deps="0.01"
      -DminPoints="10"
      -DselectedColIds="all"
      -DserverNum="2"
      -DserverCpu="300"
      -DserverMemory="3000"
      -DworkerNum="10"
      -DworkerCpu="800"
      -DworkerMemory="2000"
    Parameter Required Description Default value
    inputTable Yes The name of the input table. N/A
    outputTable Yes The name of the output table. N/A
    inputType No The type of the input data. Valid values:
    • 0: A vector is used as an input.
    • 1: A neighbor table is used as an input.
    0
    pointDim No The vector dimension of the input data. This parameter is required only if the inputType parameter is set to 0.
    Note If the dataType parameter is set to DenseMultiCols, the value of the pointDim parameter must be the same as that of the selectedColIds parameter.
    10
    eps No The threshold of the distance between two neighbors. If the distance between two points is less than the threshold, the points are considered to be neighbors. This parameter is required only if the inputType parameter is set to 0. 1.0
    minPoints No The density threshold for a core object. If the number of other points in the neighborhood of a point is greater than the threshold specified by this parameter, the point is a core object. 10
    dataType No The format of the input table. This parameter is required only if the inputType parameter is set to 0. Valid values:
    • DenseMultiCols: Multiple columns are used to represent a vector.
    • Dense2Cols: A single column is used to represent a vector. Separate dimensions of the vector with commas (,).
    Dense2Cols
    selectedColIds No The column where the data is located. This parameter is required only if the dataType parameter is set to DenseMultiCols. You can set the parameter to all or a value in the format of 0,1,3.
    Note The ID column must be the first column.
    all
    serverNum Yes The number of servers. 5
    workerNum Yes The number of workers. 30
    serverCpu Yes The number of CPUs for each server. 8
    workerCpu Yes The number of CPUs for each worker. 8
    workerMemory Yes The memory size of each worker. Unit: MB. 10000
    serverMemory Yes The memory size of each server. Unit: MB. 10000

Input data

DBSCAN allows you to use a neighbor table or a vector as an input, and to use two or more columns to represent a vector. Examples:
  • Use a neighbor table
    +-------------+------------+
    | mid(bigint) | f1(string) |
    +-------------+------------+
    | 0           | 2,3,0      |
    | 1           | 1,2,3,4    |
    | 2           | 2,1,5      |
    | 3           | 1,3        |
    | 4           | 1,4        |
    | 5           | 2,5,1,0    |
    +-------------+------------+
    Note The neighbors of a point must include this point. For example, the neighbors of point 0 must include point 0.
  • Use a two-dimensional vector that is represented by multiple columns
    +--------------+------------+------------+
    | mid(bigint)  | f1(double) | f2(double) |
    +--------------+------------+------------+
    | 0            | 0.0        | 0.3        |
    | 1            | 0.0        | 1.0        |
    | 2            | 0.0        | 0.1        |
    | 3            | 1.0        | 0.0        |
    | 4            | 0.0        | 0.2        |
    | 5            | 1.0        | 0.2        |
    +--------------+------------+------------+
    The first column lists the sample IDs. The second and third columns list the values of each dimension of the vector.
  • Use a two-dimensional vector that is represented by two columns
    +--------------+------------+
    | mid(bigint)  | f1(string) |
    +--------------+------------+
    | 0            | 0.0,0.3    |
    | 1            | 0.0,1.0    |
    | 2            | 0.0,0.1    |
    | 3            | 1.0,0.0    |
    | 4            | 0.0,0.2    |
    | 5            | 1.0,0.2    |
    +--------------+------------+
    The first column lists the sample IDs. The second column lists the values of each dimension of the vector. Separate the values of different dimensions with commas (,).
Note You can use the Add ID Column component under Data Preprocessing to add ID columns for each sample.