All Products
Search
Document Center

Platform For AI:Label Propagation Clustering

Last Updated:Mar 20, 2024

A label propagation algorithm (LPA) is a semi-supervised machine learning algorithm. The label (community) of a vertex depends on the labels of the neighboring vertices. The degree of dependence is determined by the similarity between vertices. Data becomes stable by performing iterative propagation updates. The Label Propagation Clustering component can provide the group of each vertex after the convergence of all vertices in a graph.

Algorithm description

  • Graph clustering is used to divide a graph into subgraphs based on the topology of the graph. Therefore, the links between the vertices in a subgraph are more than the links between the subgraphs.

  • This algorithm initializes each vertex by using a unique label, iterates through vertices, and assigns a vertex the label that most frequently appears among its neighboring vertices in a community. The algorithm stops assigning a label to a vertex until each vertex has the label that most frequently appears among its neighboring vertices.

Configure the component

Method 1: Configure the component on the pipeline page

You can add the Label Propagation Clustering component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Vertex Table: Vertex Column

The vertex column in the vertex table.

Vertex Table: Weight Column

The vertex weight column in the vertex table.

Edge Table: Source Vertex Column

The start vertex column in the edge table.

Edge Table: Target Vertex Column

The end vertex column in the edge table.

Edge Table: Weight Column

The edge weight column in the edge table.

Parameters Setting

Maximum Iterations

The maximum number of iterations. Default value: 30.

Tuning

Workers

The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.

Memory Size per Worker (MB)

The maximum size of memory that a single job can use. Unit: MB. Default value: 4096.

If the size of used memory exceeds the value of this parameter, the OutOfMemory error is reported.

Method 2: Configure the component by using PAI commands

You can configure the Label Propagation Clustering component by using PAI commands. You can use the SQL Script component to run PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component in the "SQL Script" topic.

PAI -name LabelPropagationClustering
    -project algo_public
    -DinputEdgeTableName=LabelPropagationClustering_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DinputVertexTableName=LabelPropagationClustering_func_test_node
    -DvertexCol=node
    -DoutputTableName=LabelPropagationClustering_func_test_result
    -DhasEdgeWeight=true
    -DedgeWeightCol=edge_weight
    -DhasVertexWeight=true
    -DvertexWeightCol=node_weight
    -DrandSelect=true
    -DmaxIter=100;

Parameter

Required

Default value

Description

inputEdgeTableName

Yes

No default value

The name of the input edge table.

inputEdgeTablePartitions

No

Full table

The partitions in the input edge table.

fromVertexCol

Yes

No default value

The start vertex column in the input edge table.

toVertexCol

Yes

No default value

The end vertex column in the input edge table.

inputVertexTableName

Yes

No default value

The name of the input vertex table.

inputVertexTablePartitions

No

Full table

The partitions in the input vertex table.

vertexCol

Yes

No default value

The vertex column in the input vertex table.

outputTableName

Yes

No default value

The name of the output table.

outputTablePartitions

No

No default value

The partitions in the output table.

lifecycle

No

No default value

The lifecycle of the output table.

workerNum

No

No default value

The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.

workerMem

No

4096

The maximum size of memory that a single job can use. Unit: MB. Default value: 4096.

If the size of used memory exceeds the value of this parameter, the OutOfMemory error is reported.

splitSize

No

64

The data split size. Unit: MB.

hasEdgeWeight

No

false

Specifies whether the edges in the input edge table have weights.

edgeWeightCol

No

No default value

The edge weight column in the input edge table.

hasVertexWeight

No

false

Specifies whether the vertices in the input vertex table have weights.

vertexWeightCol

No

No default value

The vertex weight column in the input vertex table.

randSelect

No

false

Specifies whether the maximum label value is to be randomly selected.

maxIter

No

30

The maximum number of iterations.

Example

  1. Add the SQL Script component as a vertex to the canvas and execute the following SQL statements to generate training data.

    drop table if exists LabelPropagationClustering_func_test_edge;
    create table LabelPropagationClustering_func_test_edge as
    select * from
    (
        select '1' as flow_out_id,'2' as flow_in_id,0.7 as edge_weight
        union all
        select '1' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight
        union all
        select '1' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight
        union all
        select '2' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight
        union all
        select '2' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight
        union all
        select '3' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight
        union all
        select '4' as flow_out_id,'6' as flow_in_id,0.3 as edge_weight
        union all
        select '5' as flow_out_id,'6' as flow_in_id,0.6 as edge_weight
        union all
        select '5' as flow_out_id,'7' as flow_in_id,0.7 as edge_weight
        union all
        select '5' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight
        union all
        select '6' as flow_out_id,'7' as flow_in_id,0.6 as edge_weight
        union all
        select '6' as flow_out_id,'8' as flow_in_id,0.6 as edge_weight
        union all
        select '7' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight
    )tmp
    ;
    drop table if exists LabelPropagationClustering_func_test_node;
    create table LabelPropagationClustering_func_test_node as
    select * from
    (
        select '1' as node,0.7 as node_weight
        union all
        select '2' as node,0.7 as node_weight
        union all
        select '3' as node,0.7 as node_weight
        union all
        select '4' as node,0.5 as node_weight
        union all
        select '5' as node,0.7 as node_weight
        union all
        select '6' as node,0.5 as node_weight
        union all
        select '7' as node,0.7 as node_weight
        union all
        select '8' as node,0.7 as node_weight
    )tmp;

    Data structure

    image

  2. Add the SQL Script component as a vertex to the canvas and run the following PAI commands to train the model.

    drop table if exists ${o1};
    PAI -name LabelPropagationClustering
        -project algo_public
        -DinputEdgeTableName=LabelPropagationClustering_func_test_edge
        -DfromVertexCol=flow_out_id
        -DtoVertexCol=flow_in_id
        -DinputVertexTableName=LabelPropagationClustering_func_test_node
        -DvertexCol=node
        -DoutputTableName=${o1}
        -DhasEdgeWeight=true
        -DedgeWeightCol=edge_weight
        -DhasVertexWeight=true
        -DvertexWeightCol=node_weight
        -DrandSelect=true
        -DmaxIter=100;
  3. Right-click the SQL Script component and choose View Data > SQL Script Output to view the training results.

    | node | group_id |
    | ---- | -------- |
    | 1    | 3        |
    | 3    | 3        |
    | 5    | 7        |
    | 7    | 7        |
    | 2    | 3        |
    | 4    | 3        |
    | 6    | 7        |
    | 8    | 7        |