edit-icon download-icon

Graph analysis

Last Updated: Aug 17, 2018

Contents


The network analysis column provides analytic algorithms which are based on the Graph data structure. The following figure shows an example of the analysis process developed with the network analysis component of the platform.

image

The running parameters need to be set for the algorithm components in the network analysis column.
The parameters are described as follows:

  • Process count: The workerNum parameter specifies the number of nodes for concurrent job execution. The concurrency level and framework communication costs increase with the value of this parameter.
  • Work memory: The workerMem parameter specifies the maximum memory size that a single worker can use. The default value is 4096 MB. The OutOfMemory exception is thrown if memory usage of a single process exceeds the maximum.

k-Core

Function overview

The KCore of a graph is the subgraph that is left after all nodes whose degrees are less than or equal to K are removed. If a node is included in the KCore but is removed from the (K+1)Core, the coreness of this node is K. Therefore, the coreness of a node whose degree is 1 must be 0. The maximum node coreness is the graph coreness.

Parameter settings

K: Coreness value, required, default value is 3.

PAI command

  1. PAI -name KCore
  2. -project algo_public
  3. -DinputEdgeTableName=KCore_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=KCore_func_test_result
  7. -Dk=2;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the edge table Required NA
toVertexCol End column in the edge table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64
k Number of cores Required 3

Example

Test data

SQL statement for data generation:

  1. drop table if exists KCore_func_test_edge;
  2. create table KCore_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id,'4' as flow_in_id from dual
  10. union all
  11. select '2' as flow_out_id,'3' as flow_in_id from dual
  12. union all
  13. select '2' as flow_out_id,'4' as flow_in_id from dual
  14. union all
  15. select '3' as flow_out_id,'4' as flow_in_id from dual
  16. union all
  17. select '3' as flow_out_id,'5' as flow_in_id from dual
  18. union all
  19. select '3' as flow_out_id,'6' as flow_in_id from dual
  20. union all
  21. select '5' as flow_out_id,'6' as flow_in_id from dual
  22. )tmp;

Structure of the graph corresponding to the data:

graph

Running result

Set K to 2.
The result is as follows:

  1. +-------+-------+
  2. | node1 | node2 |
  3. +-------+-------+
  4. | 1 | 2 |
  5. | 1 | 3 |
  6. | 1 | 4 |
  7. | 2 | 1 |
  8. | 2 | 3 |
  9. | 2 | 4 |
  10. | 3 | 1 |
  11. | 3 | 2 |
  12. | 3 | 4 |
  13. | 4 | 1 |
  14. | 4 | 2 |
  15. | 4 | 3 |
  16. +-------+-------+

Single-source shortest path

Function overview

The single-source shortest path (SSSP) refers to the Dijkstra algorithm. In this algorithm, if the start node is specified, the shortest paths between this node and all other nodes are output.

Parameter settings

Start node ID: ID of the start node used to calculate the shortest paths, which is required.

PAI command

  1. PAI -name SSSP
  2. -project algo_public
  3. -DinputEdgeTableName=SSSP_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=SSSP_func_test_result
  7. -DhasEdgeWeight=true
  8. -DedgeWeightCol=edge_weight
  9. -DstartVertex=a;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64
startVertex Start node ID Required NA
hasEdgeWeight Indicates whether the edge of the input edge table has weight Optional false
edgeWeightCol Edge weight column in the input edge table Optional NA

Example

Test data

SQL statement for data generation:

  1. drop table if exists SSSP_func_test_edge;
  2. create table SSSP_func_test_edge as
  3. select
  4. flow_out_id,flow_in_id,edge_weight
  5. from
  6. (
  7. select "a" as flow_out_id,"b" as flow_in_id,1.0 as edge_weight from dual
  8. union all
  9. select "b" as flow_out_id,"c" as flow_in_id,2.0 as edge_weight from dual
  10. union all
  11. select "c" as flow_out_id,"d" as flow_in_id,1.0 as edge_weight from dual
  12. union all
  13. select "b" as flow_out_id,"e" as flow_in_id,2.0 as edge_weight from dual
  14. union all
  15. select "e" as flow_out_id,"d" as flow_in_id,1.0 as edge_weight from dual
  16. union all
  17. select "c" as flow_out_id,"e" as flow_in_id,1.0 as edge_weight from dual
  18. union all
  19. select "f" as flow_out_id,"g" as flow_in_id,3.0 as edge_weight from dual
  20. union all
  21. select "a" as flow_out_id,"d" as flow_in_id,4.0 as edge_weight from dual
  22. ) tmp;

Structure of the graph corresponding to the data:

images

Running result

  1. +------------+------------+------------+--------------+
  2. | start_node | dest_node | distance | distance_cnt |
  3. +------------+------------+------------+--------------+
  4. | a | b | 1.0 | 1 |
  5. | a | c | 3.0 | 1 |
  6. | a | d | 4.0 | 3 |
  7. | a | a | 0.0 | 0 |
  8. | a | e | 3.0 | 1 |
  9. +------------+------------+------------+--------------+

PageRank

Function overview

PageRank is an algorithm used by Google Search to rank websites based on the structure of links to a page. The underlying assumption is as follows:

  • More important or quality websites are likely to receive more links from other websites.

  • Besides the quantity of links to a web page, the weight of the web page and the number of outgoing links also count in the calculation.

  • For the social networks of a user, the edge weight is an important factor in addition to the influence of the user.

For example, a Sina Weibo user is more likely to have influence on the family, friends, classmates, and colleagues among the user’s followers, while less on the followers of weak relationship such as strangers. In social networks, the edge weight is equivalent to the user-user relationship strength index.

The PageRank formula with the link weight is as follows:
gongshi

  • W(i): The weight of node i.
  • C(Ai): The link weight.
  • d: The damping factor.
  • W(A): The node weight after the algorithm iteration is stable, which is the influence index of each user.

Parameter settings

Maximum number of iterations: (Optional) The number of iterations before the algorithm automatically converges. The default value is 30.

PAI command

  1. PAI -name PageRankWithWeight
  2. -project algo_public
  3. -DinputEdgeTableName=PageRankWithWeight_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=PageRankWithWeight_func_test_result
  7. -DhasEdgeWeight=true
  8. -DedgeWeightCol=weight
  9. -DmaxIter 100;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64
hasEdgeWeight Indicates whether the edge of the input edge table has weight Optional false
edgeWeightCol Edge weight column in the input edge table Optional NA
maxIter Maximum number of iterations Optional 30

Example

Test data

SQL statement for data generation:

  1. drop table if exists PageRankWithWeight_func_test_edge;
  2. create table PageRankWithWeight_func_test_edge as
  3. select * from
  4. (
  5. select 'a' as flow_out_id,'b' as flow_in_id,1.0 as weight from dual
  6. union all
  7. select 'a' as flow_out_id,'c' as flow_in_id,1.0 as weight from dual
  8. union all
  9. select 'b' as flow_out_id,'c' as flow_in_id,1.0 as weight from dual
  10. union all
  11. select 'b' as flow_out_id,'d' as flow_in_id,1.0 as weight from dual
  12. union all
  13. select 'c' as flow_out_id,'d' as flow_in_id,1.0 as weight from dual
  14. )tmp;

Structure of the graph corresponding to the data:
pagerank

Running result

  1. +------+------------+
  2. | node | weight |
  3. +------+------------+
  4. | a | 0.0375 |
  5. | b | 0.06938 |
  6. | c | 0.12834 |
  7. | d | 0.20556 |
  8. +------+------------+

Label propagation clustering

Function overview

Graph clustering is used to divide subgraphs based on the graph topology such that the links between the nodes in the subgraphs are more than those between subgraphs. The label propagation algorithm (LPA) is a graph-based semi-supervised machine learning algorithm. The labels of a node (community) depend on those of the neighboring nodes. The dependence degree is determined by the similarity between nodes, and data becomes stable by iterative propagation update.

Parameter settings

Maximum number of iterations: (Optional) The default value is 30.

PAI command

  1. PAI -name LabelPropagationClustering
  2. -project algo_public
  3. -DinputEdgeTableName=LabelPropagationClustering_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DinputVertexTableName=LabelPropagationClustering_func_test_node
  7. -DvertexCol=node
  8. -DoutputTableName=LabelPropagationClustering_func_test_result
  9. -DhasEdgeWeight=true
  10. -DedgeWeightCol=edge_weight
  11. -DhasVertexWeight=true
  12. -DvertexWeightCol=node_weight
  13. -DrandSelect=true
  14. -DmaxIter=100;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
inputVertexTableName Name of the input vertex table Required NA
inputVertexTablePartitions Partitions in the input vertex table Optional Entire table
vertexCol Vertex column in the input vertex table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64
hasEdgeWeight Indicates whether the edge of the input edge table has weight Optional false
edgeWeightCol Edge weight column in the input edge table Optional NA
hasVertexWeight Indicates whether the vertexes of the input vertex table have weights Optional false
vertexWeightCol Vertex weight column in the input vertex table Optional NA
randSelect Indicates whether the maximum label is randomly selected Optional false
maxIter Maximum number of iterations Optional 30

Example

Test data

SQL statement for data generation:

  1. drop table if exists LabelPropagationClustering_func_test_edge;
  2. create table LabelPropagationClustering_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id,0.7 as edge_weight from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight from dual
  8. union all
  9. select '1' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
  10. union all
  11. select '2' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight from dual
  12. union all
  13. select '2' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
  14. union all
  15. select '3' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
  16. union all
  17. select '4' as flow_out_id,'6' as flow_in_id,0.3 as edge_weight from dual
  18. union all
  19. select '5' as flow_out_id,'6' as flow_in_id,0.6 as edge_weight from dual
  20. union all
  21. select '5' as flow_out_id,'7' as flow_in_id,0.7 as edge_weight from dual
  22. union all
  23. select '5' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight from dual
  24. union all
  25. select '6' as flow_out_id,'7' as flow_in_id,0.6 as edge_weight from dual
  26. union all
  27. select '6' as flow_out_id,'8' as flow_in_id,0.6 as edge_weight from dual
  28. union all
  29. select '7' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight from dual
  30. )tmp;
  31. drop table if exists LabelPropagationClustering_func_test_node;
  32. create table LabelPropagationClustering_func_test_node as
  33. select * from
  34. (
  35. select '1' as node,0.7 as node_weight from dual
  36. union all
  37. select '2' as node,0.7 as node_weight from dual
  38. union all
  39. select '3' as node,0.7 as node_weight from dual
  40. union all
  41. select '4' as node,0.5 as node_weight from dual
  42. union all
  43. select '5' as node,0.7 as node_weight from dual
  44. union all
  45. select '6' as node,0.5 as node_weight from dual
  46. union all
  47. select '7' as node,0.7 as node_weight from dual
  48. union all
  49. select '8' as node,0.7 as node_weight from dual
  50. )tmp;

Structure of the group corresponding to the data:

ddd

Running result

  1. +------+------------+
  2. | node | group_id |
  3. +------+------------+
  4. | 1 | 1 |
  5. | 2 | 1 |
  6. | 3 | 1 |
  7. | 4 | 1 |
  8. | 5 | 5 |
  9. | 6 | 5 |
  10. | 7 | 5 |
  11. | 8 | 5 |
  12. +------+------------+

Label propagation classification

Function overview

Label propagation classification is a semi-supervised classification algorithm that uses the label information of labeled nodes to predict that of unlabeled nodes.

During algorithm execution, the labels of each node are propagated to the neighboring nodes based on the similarity between the nodes. In each step of propagation, a node updates its label based on the labels of the neighboring nodes such that the node is more similar with the neighboring nodes. The higher the similarity, the greater weight the neighboring nodes have on that node, and the easier for labels to be propagated. During label propagation, the labels of the labeled data remain unchanged, which serve as sources for propagation to the unlabeled data.

After the iterations end, the probability distributions of similar nodes are also similar such that the nodes can be divided into the same category to complete label propagation.

Parameter settings

  • Damping factor: The default value is 0.8.
  • Convergence factor: The default value is 0.000001.

PAI command

  1. PAI -name LabelPropagationClassification
  2. -project algo_public
  3. -DinputEdgeTableName=LabelPropagationClassification_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DinputVertexTableName=LabelPropagationClassification_func_test_node
  7. -DvertexCol=node
  8. -DvertexLabelCol=label
  9. -DoutputTableName=LabelPropagationClassification_func_test_result
  10. -DhasEdgeWeight=true
  11. -DedgeWeightCol=edge_weight
  12. -DhasVertexWeight=true
  13. -DvertexWeightCol=label_weight
  14. -Dalpha=0.8
  15. -Depsilon=0.000001;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
inputVertexTableName Name of the input vertex table Required NA
inputVertexTablePartitions Partitions in the input vertex table Optional Entire table
vertexCol Vertex column in the input vertex table Required NA
vertexLabelCol Vertex label column in the input vertex table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64
hasEdgeWeight Indicates whether the edge of the input edge table has weight Optional false
edgeWeightCol Edge weight column in the input edge table Optional NA
hasVertexWeight Indicates whether the vertexes of the input vertex table have weights Optional false
vertexWeightCol Vertex weight column in the input vertex table Optional NA
alpha Damping factor Optional 0.8
epsilon Convergence factor Optional 0.000001
maxIter Maximum number of iterations Optional 30

Example

Test data

SQL statement for data generation:

  1. drop table if exists LabelPropagationClassification_func_test_edge;
  2. create table LabelPropagationClassification_func_test_edge as
  3. select * from
  4. (
  5. select 'a' as flow_out_id, 'b' as flow_in_id, 0.2 as edge_weight from dual
  6. union all
  7. select 'a' as flow_out_id, 'c' as flow_in_id, 0.8 as edge_weight from dual
  8. union all
  9. select 'b' as flow_out_id, 'c' as flow_in_id, 1.0 as edge_weight from dual
  10. union all
  11. select 'd' as flow_out_id, 'b' as flow_in_id, 1.0 as edge_weight from dual
  12. )tmp;
  13. drop table if exists LabelPropagationClassification_func_test_node;
  14. create table LabelPropagationClassification_func_test_node as
  15. select * from
  16. (
  17. select 'a' as node,'X' as label, 1.0 as label_weight from dual
  18. union all
  19. select 'd' as node,'Y' as label, 1.0 as label_weight from dual
  20. )tmp;

Structure of the graph corresponding to the data:
ddd

Running result

  1. +------+-----+------------+
  2. | node | tag | weight |
  3. +------+-----+------------+
  4. | a | X | 1.0 |
  5. | b | X | 0.16667 |
  6. | b | Y | 0.83333 |
  7. | c | X | 0.53704 |
  8. | c | Y | 0.46296 |
  9. | d | Y | 1.0 |
  10. +------+-----+------------+

Modularity

Function overview

Modularity is a measure of the structure of networks. It measures the closeness of communities divided from a network structure. A value larger than 0.3 represents an obvious community structure.

PAI command

  1. PAI -name Modularity
  2. -project algo_public
  3. -DinputEdgeTableName=Modularity_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DfromGroupCol=group_out_id
  6. -DtoVertexCol=flow_in_id
  7. -DtoGroupCol=group_in_id
  8. -DoutputTableName=Modularity_func_test_result;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
fromGroupCol Group of the start node in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
toGroupCol Group of the end vertex in the input edge table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64

Example

Test data

Same as the data in Label propagation clustering.

Running result

  1. +--------------+
  2. | val |
  3. +--------------+
  4. | 0.4230769 |
  5. +--------------+

Maximum connected subgraphs

Function overview

In an undirected graph G, vertex A is connected to vertex B if a path exists between them. Graph G contains several subgraphs, where each vertex is connected to every other vertex in the same subgraph but is separated from those in other subgraphs. These subgraphs of graph G are maximum connected subgraphs.

Parameter settings

None

PAI command

  1. PAI -name MaximalConnectedComponent
  2. -project algo_public
  3. -DinputEdgeTableName=MaximalConnectedComponent_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=MaximalConnectedComponent_func_test_result;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64

Example

Test data

SQL statement for data generation:

  1. drop table if exists MaximalConnectedComponent_func_test_edge;
  2. create table MaximalConnectedComponent_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '2' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '3' as flow_out_id,'4' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id,'4' as flow_in_id from dual
  12. union all
  13. select 'a' as flow_out_id,'b' as flow_in_id from dual
  14. union all
  15. select 'b' as flow_out_id,'c' as flow_in_id from dual
  16. )tmp;
  17. drop table if exists MaximalConnectedComponent_func_test_result;
  18. create table MaximalConnectedComponent_func_test_result
  19. (
  20. node string,
  21. grp_id string
  22. );

Structure of the graph corresponding to the data:
Snip20160228_11

Running result

  1. +-------+-------+
  2. | node | grp_id|
  3. +-------+-------+
  4. | 1 | 4 |
  5. | 2 | 4 |
  6. | 3 | 4 |
  7. | 4 | 4 |
  8. | a | c |
  9. | b | c |
  10. | c | c |
  11. +-------+-------+

Node clustering coefficient

Function overview

This coefficient is used to calculate the peripheral density of a node in an undirected graph G. The density of a star network is 0, and that of a fully meshed network is 1.

Parameter settings

maxEdgeCnt: If the node degree is greater than the value of this parameter, sampling is performed. This parameter is optional, and the default value is 500.

PAI command

  1. PAI -name NodeDensity
  2. -project algo_public
  3. -DinputEdgeTableName=NodeDensity_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=NodeDensity_func_test_result
  7. -DmaxEdgeCnt=500;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
maxEdgeCnt If the node degree is greater than the value of this parameter, sampling is performed. Optional 500
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64

Example

Test data

SQL statement for data generation:

  1. drop table if exists NodeDensity_func_test_edge;
  2. create table NodeDensity_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id, '2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id, '3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id, '4' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id, '5' as flow_in_id from dual
  12. union all
  13. select '1' as flow_out_id, '6' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id, '3' as flow_in_id from dual
  16. union all
  17. select '3' as flow_out_id, '4' as flow_in_id from dual
  18. union all
  19. select '4' as flow_out_id, '5' as flow_in_id from dual
  20. union all
  21. select '5' as flow_out_id, '6' as flow_in_id from dual
  22. union all
  23. select '5' as flow_out_id, '7' as flow_in_id from dual
  24. union all
  25. select '6' as flow_out_id, '7' as flow_in_id from dual
  26. )tmp;
  27. drop table if exists NodeDensity_func_test_result;
  28. create table NodeDensity_func_test_result
  29. (
  30. node string,
  31. node_cnt bigint,
  32. edge_cnt bigint,
  33. density double,
  34. log_density double
  35. );

Structure of the graph corresponding to the data:
Snip20160228_12

Running result

  1. 1,5,4,0.4,1.45657
  2. 2,2,1,1.0,1.24696
  3. 3,3,2,0.66667,1.35204
  4. 4,3,2,0.66667,1.35204
  5. 5,4,3,0.5,1.41189
  6. 6,3,2,0.66667,1.35204
  7. 7,2,1,1.0,1.24696

Edge clustering coefficient

Function overview

This coefficient is used to calculate the peripheral density of each edge in an undirected graph G.

Parameter settings

None

PAI command

  1. PAI -name EdgeDensity
  2. -project algo_public
  3. -DinputEdgeTableName=EdgeDensity_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=EdgeDensity_func_test_result;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64

Example

Test data

SQL statement for data generation:

  1. drop table if exists EdgeDensity_func_test_edge;
  2. create table EdgeDensity_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id,'5' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id,'7' as flow_in_id from dual
  12. union all
  13. select '2' as flow_out_id,'5' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id,'4' as flow_in_id from dual
  16. union all
  17. select '2' as flow_out_id,'3' as flow_in_id from dual
  18. union all
  19. select '3' as flow_out_id,'5' as flow_in_id from dual
  20. union all
  21. select '3' as flow_out_id,'4' as flow_in_id from dual
  22. union all
  23. select '4' as flow_out_id,'5' as flow_in_id from dual
  24. union all
  25. select '4' as flow_out_id,'8' as flow_in_id from dual
  26. union all
  27. select '5' as flow_out_id,'6' as flow_in_id from dual
  28. union all
  29. select '5' as flow_out_id,'7' as flow_in_id from dual
  30. union all
  31. select '5' as flow_out_id,'8' as flow_in_id from dual
  32. union all
  33. select '7' as flow_out_id,'6' as flow_in_id from dual
  34. union all
  35. select '6' as flow_out_id,'8' as flow_in_id from dual
  36. )tmp;
  37. drop table if exists EdgeDensity_func_test_result;
  38. create table EdgeDensity_func_test_result
  39. (
  40. node1 string,
  41. node2 string,
  42. node1_edge_cnt bigint,
  43. node2_edge_cnt bigint,
  44. triangle_cnt bigint,
  45. density double
  46. );

Structure of the graph corresponding to the data:
Snip20160228_13

Running result

  1. 1,2,4,4,2,0.5
  2. 2,3,4,4,3,0.75
  3. 2,5,4,7,3,0.75
  4. 3,1,4,4,2,0.5
  5. 3,4,4,4,2,0.5
  6. 4,2,4,4,2,0.5
  7. 4,5,4,7,3,0.75
  8. 5,1,7,4,3,0.75
  9. 5,3,7,4,3,0.75
  10. 5,6,7,3,2,0.66667
  11. 5,8,7,3,2,0.66667
  12. 6,7,3,3,1,0.33333
  13. 7,1,3,4,1,0.33333
  14. 7,5,3,7,2,0.66667
  15. 8,4,3,4,1,0.33333
  16. 8,6,3,3,1,0.33333

Triangle count

Function overview

Output all triangles in an undirected graph G.

Parameter settings

maxEdgeCnt: If the node degree is greater than the value of this parameter, sampling is performed. This parameter is optional, and the default value is 500.

PAI command

  1. PAI -name TriangleCount
  2. -project algo_public
  3. -DinputEdgeTableName=TriangleCount_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=TriangleCount_func_test_result;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
maxEdgeCnt If the node degree is greater than the value of this parameter, sampling is performed. Optional 500
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64

Example

Test data

SQL statement for data generation:

  1. drop table if exists TriangleCount_func_test_edge;
  2. create table TriangleCount_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id,'4' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id,'5' as flow_in_id from dual
  12. union all
  13. select '1' as flow_out_id,'6' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id,'3' as flow_in_id from dual
  16. union all
  17. select '3' as flow_out_id,'4' as flow_in_id from dual
  18. union all
  19. select '4' as flow_out_id,'5' as flow_in_id from dual
  20. union all
  21. select '5' as flow_out_id,'6' as flow_in_id from dual
  22. union all
  23. select '5' as flow_out_id,'7' as flow_in_id from dual
  24. union all
  25. select '6' as flow_out_id,'7' as flow_in_id from dual
  26. )tmp;
  27. drop table if exists TriangleCount_func_test_result;
  28. create table TriangleCount_func_test_result
  29. (
  30. node1 string,
  31. node2 string,
  32. node3 string
  33. );

Structure of the graph corresponding to the data:
Snip20160228_12

Running result

  1. 1,2,3
  2. 1,3,4
  3. 1,4,5
  4. 1,5,6
  5. 5,6,7

Decision tree depth

Function overview

Output depth and tree ID of each node in a network composed of many trees.

Parameter settings

None

PAI command

  1. PAI -name TreeDepth
  2. -project algo_public
  3. -DinputEdgeTableName=TreeDepth_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=TreeDepth_func_test_result;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputEdgeTableName Name of the input edge table Required NA
inputEdgeTablePartitions Partitions in the input edge table Optional Entire table
fromVertexCol Start column in the input edge table Required NA
toVertexCol End column in the input edge table Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
lifecycle Life cycle of the output table Optional NA
workerNum Worker count Optional Not set
workerMem Worker memory size Optional 4096
splitSize Data split size Optional 64

Example

Test data

SQL statement for data generation:

  1. drop table if exists TreeDepth_func_test_edge;
  2. create table TreeDepth_func_test_edge as
  3. select * from
  4. (
  5. select '0' as flow_out_id, '1' as flow_in_id from dual
  6. union all
  7. select '0' as flow_out_id, '2' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id, '3' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id, '4' as flow_in_id from dual
  12. union all
  13. select '2' as flow_out_id, '4' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id, '5' as flow_in_id from dual
  16. union all
  17. select '4' as flow_out_id, '6' as flow_in_id from dual
  18. union all
  19. select 'a' as flow_out_id, 'b' as flow_in_id from dual
  20. union all
  21. select 'a' as flow_out_id, 'c' as flow_in_id from dual
  22. union all
  23. select 'c' as flow_out_id, 'd' as flow_in_id from dual
  24. union all
  25. select 'c' as flow_out_id, 'e' as flow_in_id from dual
  26. )tmp;
  27. drop table if exists TreeDepth_func_test_result;
  28. create table TreeDepth_func_test_result
  29. (
  30. node string,
  31. root string,
  32. depth bigint
  33. );

Structure of the graph corresponding to the data:
image

Running result

  1. 0,0,0
  2. 1,0,1
  3. 2,0,1
  4. 3,0,2
  5. 4,0,2
  6. 5,0,2
  7. 6,0,3
  8. a,a,0
  9. b,a,1
  10. c,a,1
  11. d,a,2
  12. e,a,2
Thank you! We've received your feedback.