All Products
Search
Document Center

Platform For AI:Page Rank

Last Updated:Mar 20, 2024

The PageRank algorithm is used to measure the importance of a web page. The PageRank algorithm analyzes hyperlinks to determine the importance of a web page based on the number and quality of links to a web page. A larger number of links to a web page indicates a higher ranking of the web page. The weights of the link sources also affect the final PageRank score of the web page. The Page Rank component is used to calculate the weight of each node.

Description

The PageRank algorithm analyzes the links to a web page to evaluate the relative importance of the web page. The PageRank algorithm works based on the following core principles:

  • A larger number of links from other web pages to a web page indicates higher importance or quality of the web page.

  • The PageRank algorithm collects the number of links from other web pages to a web page and takes the weights of the other web pages into account. The weight of a web page is calculated based on the PageRank score of the web page and the number of links from the web page to other web pages.

The PageRank algorithm can also be applied to social networks. In a social network, the influence of a user is determined by its personal attributes and the quality of its social connections. For example, the influence of a Sina Weibo user on their followers is affected by the closeness of the relationship with the followers. In most cases, a Sina Weibo user is more likely to have influence on their families, classmates, and colleagues. In a social network, the edge weight reflects the closeness of the relationship between users and is considered to be the relationship strength index.

PageRank formula that includes the link weightPageRank公式

  • W(i): the weight of Node i.

  • C(Ai): the link weight.

  • d: the damping coefficient.

  • W(A): the influence index of each user and the node weight after the algorithm iteration becomes stable.

Configure the component

Method 1: Configure the component on the pipeline page

Configure the parameters of the Page Rank component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Source Vertex Column

The start vertex column in the edge table.

Target Vertex Column

The end vertex column in the edge table.

Edge Weight Column

The edge weight column in the edge table.

Parameters Setting

Maximum Iterations

The number of iterations before the algorithm automatically converges. Default value: 30.

Damping Coefficient

The probability that a user continues browsing.

Tuning

Workers

The number of nodes for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.

Memory Size per Worker (MB)

The maximum size of memory that can be used by a job. Unit: MB. Default value: 4096.

If the size of the used memory exceeds the value of this parameter, the OutOfMemory error is reported.

Method 2: Use PAI commands

Configure the parameters of the Page Rank component by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component.

PAI -name PageRankWithWeight
    -project algo_public
    -DinputEdgeTableName=PageRankWithWeight_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=PageRankWithWeight_func_test_result
    -DhasEdgeWeight=true
    -DedgeWeightCol=weight
    -DmaxIter 100;

Parameter

Required

Default value

Description

inputEdgeTableName

Yes

No default value

The name of the input edge table.

inputEdgeTablePartitions

No

Full table

The partitions in the input edge table.

fromVertexCol

Yes

No default value

The start vertex column in the input edge table.

toVertexCol

Yes

No default value

The end vertex column in the input edge table.

outputTableName

Yes

No default value

The name of the output table.

outputTablePartitions

No

No default value

The partitions in the output table.

lifecycle

No

No default value

The lifecycle of the output table.

workerNum

No

No default value

The number of nodes for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.

workerMem

No

4096

The maximum size of memory that can be used by a job. Unit: MB. Default value: 4096.

If the size of the used memory exceeds the value of this parameter, the OutOfMemory error is reported.

splitSize

No

64

The data split size. Unit: MB.

hasEdgeWeight

No

false

Specifies whether the edges in the input edge table have weights.

edgeWeightCol

No

No default value

The edge weight column in the input edge table.

maxIter

No

30

The maximum number of iterations.

Example

  1. Add the SQL Script component as a node to the canvas and execute the following SQL statements to generate training data.

    drop table if exists PageRankWithWeight_func_test_edge;
    create table PageRankWithWeight_func_test_edge as
    select * from
    (
        select 'a' as flow_out_id,'b' as flow_in_id,1.0 as weight
        union all
        select 'a' as flow_out_id,'c' as flow_in_id,1.0 as weight
        union all
        select 'b' as flow_out_id,'c' as flow_in_id,1.0 as weight
        union all
        select 'b' as flow_out_id,'d' as flow_in_id,1.0 as weight
        union all
        select 'c' as flow_out_id,'d' as flow_in_id,1.0 as weight
    )tmp;

    Data structure

    image

  2. Add the SQL Script component as a node to the canvas and run the following PAI commands to train the model.

    drop table if exists ${o1};
    PAI -name PageRankWithWeight
        -project algo_public
        -DinputEdgeTableName=PageRankWithWeight_func_test_edge
        -DfromVertexCol=flow_out_id
        -DtoVertexCol=flow_in_id
        -DoutputTableName=${o1}
        -DhasEdgeWeight=true
        -DedgeWeightCol=weight
        -DmaxIter 100;
  3. Right-click the SQL Script component and choose View Data > SQL Script Output to view the training results.

    | node | weight     |
    | ---- | ---------- |
    | a    | 0.12841452 |
    | b    | 0.18299069 |
    | c    | 0.26076174 |
    | d    | 0.42783305 |