Density-based spatial clustering of applications with noise (DBSCAN) is an unsupervised clustering algorithm based on data density. In a cluster, DBSCAN uses specified radius and quantity thresholds to filter core points and neighbors in a region. DBSCAN uses density-reachable and density-connected theories to cluster data points.

DBSCAN

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Parameters Setting Input Data Type The type of the input data. Valid values: Adjacency List and Vector.
    Data Vector Dimension The vector dimension of the input data. This parameter is required only if the Input Data Type parameter is set to Vector.
    Note If the Input Table Format parameter is set to Multiple Columns, the value of the Data Vector Dimension parameter must be the same as the number of data columns you selected.
    Neighborhood Point Distance Threshold If the distance between two points is less than the threshold, the points are neighbors of each other. This parameter is required only if the Input Data Type parameter is set to Vector.
    Core Object Density Threshold If the number of points in the neighborhood of a point is greater than the threshold specified by this parameter, the point is a core object.
    Input Table Format This parameter is required only if the Input Data Type is set to Vector. Valid values:
    • Multiple Columns: Multiple columns are used to represent a vector.
    • Two Columns: A single column is used to represent a vector. Separate dimensions of the vector with commas (,).
    Fields Information Data Columns (First Select ID Column) This parameter is required only if the Input Table Format parameter is set to Multiple Columns.
    Tuning Number of Servers The number of servers.
    Workers (>1) The number of workers.
    Number of CPUs per Server The number of CPUs for a server.
    Number of CPUs per Worker The number of CPUs for a worker.
    Memory Size per Worker The memory size of each worker. Unit: MB.
    Memory Size per Server The memory size of each server. Unit: MB.
  • Use commands
    DBSCAN allows you to use a neighbor table or a vector as an input, and to use two or more columns to represent a vector. Sample commands:
    • Use a neighbor table as an input
      pai -name ps_dbscan
      -DinputTable=hxdb_neighbor_data_order
      -DinputType="1"
      -DoutputTable="hxtmp2"
      -DminPoints="4"
      -DserverNum="1"
      -DserverCpu="300"
      -DserverMemory="3000"
      -DworkerNum="2"
      -DworkerCpu="800"
      -DworkerMemory="2000"
    • Use a vector represented by multiple columns as an input
      pai -name ps_dbscan
      -DinputTable=hxdb_multicols_data
      -DinputType="0"
      -DoutputTable="hxtmp"
      -DdataType="DenseMultiCols"
      -DpointDim="12"
      -Deps="4"
      -DminPoints="20"
      -DselectedColIds="all"
      -DserverNum="2"
      -DserverCpu="300"
      -DserverMemory="3000"
      -DworkerNum="10"
      -DworkerCpu="800"
      -DworkerMemory="2000"
    • Use a vector represented by two columns as an input
      pai -name ps_dbscan
      -DinputTable="hxdb_sample_60w"
      -DinputType="0"
      -DoutputTable="hxtmp1"
      -DdataType="Dense2Cols"
      -DpointDim="2"
      -Deps="0.01"
      -DminPoints="10"
      -DselectedColIds="all"
      -DserverNum="2"
      -DserverCpu="300"
      -DserverMemory="3000"
      -DworkerNum="10"
      -DworkerCpu="800"
      -DworkerMemory="2000"
    Parameter Required Description Default value
    inputTable Yes The name of the input table. N/A
    outputTable Yes The name of the output table. N/A
    inputType No The type of the input data. Valid values:
    • 0: A vector is used as an input.
    • 1: A neighbor table is used as an input.
    0
    pointDim No The vector dimension of the input data. This parameter is required only if the inputType parameter is set to 0.
    Note If the dataType parameter is set to DenseMultiCols, the value of the pointDim parameter must be the same as the number of the columns specified by the selectedColIds parameter.
    10
    eps No The threshold of the distance between two neighbors. If the distance between two points is less than the threshold, the points are neighbors of each other. This parameter is required only if the inputType parameter is set to 0. 1.0
    minPoints No The density threshold for a core object. If the number of points in the neighborhood of a point is greater than the threshold specified by this parameter, the point is a core object. 10
    dataType No The format of the input table. This parameter is required only if the inputType parameter is set to 0. Valid values:
    • DenseMultiCols: Multiple columns are used to represent a vector.
    • Dense2Cols: A single column is used to represent a vector. Separate dimensions of the vector with commas (,).
    Dense2Cols
    selectedColIds No The column where the data is located. This parameter is required only if the dataType parameter is set to DenseMultiCols. You can set the parameter to all or a value in the format of 0,1,3.
    Note The ID column is the first column.
    all
    serverNum Yes The number of servers. 5
    workerNum Yes The number of workers. 30
    serverCpu Yes The number of CPUs for a server. 8
    workerCpu Yes The number of CPUs for a worker. 8
    workerMemory Yes The memory size of each worker. Unit: MB. 10000
    serverMemory Yes The memory size of each server. Unit: MB. 10000

Input data

DBSCAN allows you to use a neighbor table or a vector as an input, and to use two or more columns to represent a vector. Example:
  • Use a neighbor table
    +-------------+------------+
    | mid(bigint) | f1(string) |
    +-------------+------------+
    | 0           | 2,3,0      |
    | 1           | 1,2,3,4    |
    | 2           | 2,1,5      |
    | 3           | 1,3        |
    | 4           | 1,4        |
    | 5           | 2,5,1,0    |
    +-------------+------------+
    Note The neighbors of a point must include this point. For example, the neighbors of point 0 must include point 0.
  • Use a two-dimensional vector that is represented by multiple columns
    +--------------+------------+------------+
    | mid(bigint)  | f1(double) | f2(double) |
    +--------------+------------+------------+
    | 0            | 0.0        | 0.3        |
    | 1            | 0.0        | 1.0        |
    | 2            | 0.0        | 0.1        |
    | 3            | 1.0        | 0.0        |
    | 4            | 0.0        | 0.2        |
    | 5            | 1.0        | 0.2        |
    +--------------+------------+------------+
    The first column lists the sample IDs. The second and third columns list the values of each dimension of the vector.
  • Use a two-dimensional vector that is represented by two columns
    +--------------+------------+
    | mid(bigint)  | f1(string) |
    +--------------+------------+
    | 0            | 0.0,0.3    |
    | 1            | 0.0,1.0    |
    | 2            | 0.0,0.1    |
    | 3            | 1.0,0.0    |
    | 4            | 0.0,0.2    |
    | 5            | 1.0,0.2    |
    +--------------+------------+
    The first column lists the sample IDs. The second column lists the values of each dimension of the vector. Separate the values of different dimensions with commas (,).
Note You can use the Add ID Column component under Data Preprocessing to add ID columns for each sample.