PageRank is an algorithm that is used to sort and calculate the rankings of web pages based on their link sources. This topic describes the Page Rank component provided by Machine Learning Studio.
Background information
The basic idea behind PageRank is that:
- A larger number of web pages that direct to a web page indicates higher importance or quality of the web page.
- In addition to the number of links that direct to a web page, the weight of the web page and the number of external links are considered during page ranking.
- For a social network of users, the edge weight is an important factor in addition to the influence of the users.
For example, a Sina Weibo user is more likely to have influence on their family, friends, classmates, and colleagues than they have on followers with a weaker relationship. In a social network, the edge weight is equivalent to the user-to-user relationship strength index.
The PageRank formula with the link weight:

- W(i): indicates the weight of Node i.
- C(Ai): indicates the link weight.
- d: indicates the damping coefficient.
- W(A): indicates the influence index of each user and represents the node weight after the algorithm iteration becomes stable.
You can configure the component by using one of the following methods:
Machine Learning Platform for AI console
Tab | Parameter | Description |
---|---|---|
Fields Settings | Source Vertex Column | The start vertex column in the edge table. |
Target Vertex Column | The end vertex column in the edge table. | |
Edge Weight Column | The edge weight column in the edge table. | |
Parameters Settings | Maximum Iterations | The number of iterations before the algorithm automatically converges. Default value: 30. This parameter is optional. |
Damping Coefficient | The probability that a user continues browsing. | |
Tuning | Workers | The number of vertices for parallel job execution. The parallelism level and framework communication costs increase with the value of this parameter. |
Memory Size Per Worker (MB) | The maximum size of memory that a single job can use. By default, the system allocates 4,096 MB for each job. If the used memory size exceeds the value of this parameter, the OutOfMemory exception is reported. |
PAI command
PAI -name PageRankWithWeight
-project algo_public
-DinputEdgeTableName=PageRankWithWeight_func_test_edge
-DfromVertexCol=flow_out_id
-DtoVertexCol=flow_in_id
-DoutputTableName=PageRankWithWeight_func_test_result
-DhasEdgeWeight=true
-DedgeWeightCol=weight
-DmaxIter 100;
Parameter | Required | Description | Default value |
---|---|---|---|
inputEdgeTableName | Yes | The name of the input edge table. | No default value |
inputEdgeTablePartitions | No | The partitions in the input edge table. | Full table |
fromVertexCol | Yes | The start vertex column in the input edge table. | No default value |
toVertexCol | Yes | The end vertex column in the input edge table. | No default value |
outputTableName | Yes | The name of the output table. | No default value |
outputTablePartitions | No | The partitions in the output table. | No default value |
lifecycle | No | The lifecycle of the output table. | No default value |
workerNum | No | The number of vertices for parallel job execution. The parallelism level and framework communication costs increase with the value of this parameter. | Not configured |
workerMem | No | The maximum size of memory that a single job can use. By default, the system allocates 4,096 MB for each job. If the used memory size exceeds the value of this parameter, the OutOfMemory exception is reported. | 4096 |
splitSize | No | The data split size. | 64 |
hasEdgeWeight | No | Specifies whether the edges in the input edge table have weights. | false |
edgeWeightCol | No | The edge weight column in the input edge table. | No default value |
maxIter | No | The maximum number of iterations. | 30 |
Examples
- Generate training data.
drop table if exists PageRankWithWeight_func_test_edge; create table PageRankWithWeight_func_test_edge as select * from ( select 'a' as flow_out_id,'b' as flow_in_id,1.0 as weight from dual union all select 'a' as flow_out_id,'c' as flow_in_id,1.0 as weight from dual union all select 'b' as flow_out_id,'c' as flow_in_id,1.0 as weight from dual union all select 'b' as flow_out_id,'d' as flow_in_id,1.0 as weight from dual union all select 'c' as flow_out_id,'d' as flow_in_id,1.0 as weight from dual )tmp;
The following figure shows the structure of the PageRank graph. - View training results.
+------+------------+ | node | weight | +------+------------+ | a | 0.0375 | | b | 0.06938 | | c | 0.12834 | | d | 0.20556 | +------+------------+