DBSCAN

Limits

This DBSCAN component can be used only in Machine Learning Studio.
In the input table of the DBSCAN component, the first column is the sample ID column that must be named mid and have integers that start with 0 as values. The second and later columns are vector dimension columns that must have f1, f2, or another value in the same format as the name.

You can configure the component by using one of the following methods:

Machine Learning Platform for AI (PAI) console


Tab	Parameter	Description
Parameters Setting	Input Data Type	The type of the input data. Valid values: Adjacency List and Vector.
	Data Vector Dimension	The vector dimension of the input data. This parameter is required only if the Input Data Type parameter is set to Vector. Note If the Input Table Format parameter is set to Multiple Columns, the value of the Data Vector Dimension parameter must be the same as that of the Data Columns (First Select ID Column) parameter.
	Neighborhood Point Distance Threshold	If the distance between two points is less than the threshold, the points are considered to be neighbors. This parameter is required only if the Input Data Type parameter is set to Vector.
	Core Object Density Threshold	If the number of other points in the neighborhood of a point is greater than the threshold specified by this parameter, the point is a core object.
	Input Table Format	This parameter is required only if the Input Data Type parameter is set to Vector. Valid values: Multiple Columns: Multiple columns are used to represent a vector. Two Columns: A single column is used to represent a vector. Separate dimensions of the vector with commas (,).
Fields Information	Data Columns (First Select ID Column)	This parameter is required only if the Input Table Format parameter is set to Multiple Columns.
Tuning	Servers	The number of servers.
	Workers (>1)	The number of workers.
	CPUs per Server	The number of CPUs for each server.
	CPUs per Worker	The number of CPUs for each worker.
	Memory Size per Worker	The memory size of each worker. Unit: MB.
	Memory Size per Server	The memory size of each server. Unit: MB.

PAI commands

DBSCAN allows you to use a neighbor table or a vector as an input, and to use two or more columns to represent a vector. Sample commands:

Use a neighbor table as an input

pai -name ps_dbscan
-DinputTable=hxdb_neighbor_data_order
-DinputType="1"
-DoutputTable="hxtmp2"
-DminPoints="4"
-DserverNum="1"
-DserverCpu="300"
-DserverMemory="3000"
-DworkerNum="2"
-DworkerCpu="800"
-DworkerMemory="2000"

Use a vector represented by multiple columns as an input

pai -name ps_dbscan
-DinputTable=hxdb_multicols_data
-DinputType="0"
-DoutputTable="hxtmp"
-DdataType="DenseMultiCols"
-DpointDim="12"
-Deps="4"
-DminPoints="20"
-DselectedColIds="all"
-DserverNum="2"
-DserverCpu="300"
-DserverMemory="3000"
-DworkerNum="10"
-DworkerCpu="800"
-DworkerMemory="2000"

Use a vector represented by two columns as an input

pai -name ps_dbscan
-DinputTable="hxdb_sample_60w"
-DinputType="0"
-DoutputTable="hxtmp1"
-DdataType="Dense2Cols"
-DpointDim="2"
-Deps="0.01"
-DminPoints="10"
-DselectedColIds="all"
-DserverNum="2"
-DserverCpu="300"
-DserverMemory="3000"
-DworkerNum="10"
-DworkerCpu="800"
-DworkerMemory="2000"


Parameter	Required	Description	Default value
inputTable	Yes	The name of the input table.	N/A
outputTable	Yes	The name of the output table.	N/A
inputType	No	The type of the input data. Valid values: 0: A vector is used as an input. 1: A neighbor table is used as an input.	0
pointDim	No	The vector dimension of the input data. This parameter is required only if the inputType parameter is set to 0. Note If the dataType parameter is set to DenseMultiCols, the value of the pointDim parameter must be the same as that of the selectedColIds parameter.	10
eps	No	The threshold of the distance between two neighbors. If the distance between two points is less than the threshold, the points are considered to be neighbors. This parameter is required only if the inputType parameter is set to 0.	1.0
minPoints	No	The density threshold for a core object. If the number of other points in the neighborhood of a point is greater than the threshold specified by this parameter, the point is a core object.	10
dataType	No	The format of the input table. This parameter is required only if the inputType parameter is set to 0. Valid values: DenseMultiCols: Multiple columns are used to represent a vector. Dense2Cols: A single column is used to represent a vector. Separate dimensions of the vector with commas (,).	Dense2Cols
selectedColIds	No	The column where the data is located. This parameter is required only if the dataType parameter is set to DenseMultiCols. You can set the parameter to all or a value in the format of 0,1,3. Note The ID column must be the first column.	all
serverNum	Yes	The number of servers.	5
workerNum	Yes	The number of workers.	30
serverCpu	Yes	The number of CPUs for each server.	8
workerCpu	Yes	The number of CPUs for each worker.	8
workerMemory	Yes	The memory size of each worker. Unit: MB.	10000
serverMemory	Yes	The memory size of each server. Unit: MB.	10000

Input data

DBSCAN allows you to use a neighbor table or a vector as an input, and to use two or more columns to represent a vector. Examples:

Use a neighbor table

+-------------+------------+
| mid(bigint) | f1(string) |
+-------------+------------+
| 0           | 2,3,0      |
| 1           | 1,2,3,4    |
| 2           | 2,1,5      |
| 3           | 1,3        |
| 4           | 1,4        |
| 5           | 2,5,1,0    |
+-------------+------------+

Note The neighbors of a point must include this point. For example, the neighbors of point 0 must include point 0.

Use a two-dimensional vector that is represented by multiple columns

+--------------+------------+------------+
| mid(bigint)  | f1(double) | f2(double) |
+--------------+------------+------------+
| 0            | 0.0        | 0.3        |
| 1            | 0.0        | 1.0        |
| 2            | 0.0        | 0.1        |
| 3            | 1.0        | 0.0        |
| 4            | 0.0        | 0.2        |
| 5            | 1.0        | 0.2        |
+--------------+------------+------------+

The first column lists the sample IDs. The second and third columns list the values of each dimension of the vector.

Use a two-dimensional vector that is represented by two columns

+--------------+------------+
| mid(bigint)  | f1(string) |
+--------------+------------+
| 0            | 0.0,0.3    |
| 1            | 0.0,1.0    |
| 2            | 0.0,0.1    |
| 3            | 1.0,0.0    |
| 4            | 0.0,0.2    |
| 5            | 1.0,0.2    |
+--------------+------------+

The first column lists the sample IDs. The second column lists the values of each dimension of the vector. Separate the values of different dimensions with commas (,).

Note You can use the Add ID Column component under Data Preprocessing to add ID columns for each sample.