All Products
Search
Document Center

Platform For AI:PMI

Last Updated:Apr 03, 2024

The PMI algorithm component of Platform for AI (PAI) is used to count the co-occurrence of all words in several documents and calculate the pointwise mutual information (PMI). This topic describes how to configure the PMI algorithm component.

Background information

In information theory, mutual information (MI) can be regarded as the amount of information that is contained in a random variable of another variable, or the reduction in uncertainty of a random variable due to the known random variable.

PMI is used to quantify the relevance between two words. Definition: PMI(x,y)=ln(p(x,y)/(p(x)p(y)))=ln(#(x,y)D/(#x#y)). In the definition, #(x,y) indicates the number of pairs (x,y). D indicates the total number of pairs. If x and y appear in the same window, the output is #x+=1, #y+=1, and #(x,y)+=1. For more information about PMI, see PMI.

Configure the component

You can use one of the following methods to configure the PMI component:

Method 1: Configure the component in the PAI console

You can configure the parameters of the PMI component on the pipeline page of Machine Learning Designer.

Tab

Parameter

Description

Fields Setting

Columns of Documents with Words Separated with Spaces

N/A

Parameters Setting

Minimum Frequency of Words

Words that appear for a number of times less than this value are filtered out. Default value: 5.

Window Size

The window size. For example, a value of 5 indicates the five adjacent words on the right of the current word. Words that appear in the window are considered related to the current word.

Tuning

Computing Cores

The number of cores used for calculation. By default, the system determines the value.

Memory Size per Core (Unit: MB)

The memory size of each core. By default, the system determines the value.

Method 2: Configure the parameters by using PAI commands

The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.

PAI -name PointwiseMutualInformation    
    -project algo_public    
    -DinputTableName=maple_test_pmi_basic_input    
    -DdocColName=doc    
    -DoutputTableName=maple_test_pmi_basic_output    
    -DminCount=0    
    -DwindowSize=2    
    -DcoreNum=1    
    -DmemSizePerCore=110;

Parameter

Required

Description

Default value

inputTableName

Yes

Input table

N/A

outputTableName

Yes

Output table

N/A

docColName

Yes

The name of the document column after word segmentation, in which words are separated with spaces.

N/A

windowSize

No

The window size. For example, a value of 5 indicates the five adjacent words on the right of the current word. Words that appear in the window are considered related to the current word.

All content in a row

minCount

No

The minimum frequency of words for truncation. Words that appear for a number of times lower than this value are filtered out.

5

inputTablePartitions

No

The partitions selected from the input table for training, which are in the Partition_name=value format. To specify multiple partitions, use the following format: name1=value1/name2=value2. Separate multiple partitions with commas (,).

All partitions

lifecycle

No

The lifecycle of the output table.

N/A

coreNum

No

The number of cores used for calculation. Valid values: [1,9999].

Determined by the system

memSizePerCore

No

The memory size of each core. Unit: MB. Valid values: [1024,65536].

Determined by the system

Sample command

  • Input

    Create a table named maple_test_pmi_basic_input by using the ODPS SQL node. For more information, see Develop a MaxCompute SQL task. Sample command:

    create table maple_test_pmi_basic_input as
    select * from
    (  
        select "w1 w2 w3 w4 w5 w6 w7 w8 w8 w9" as doc
        union all  
        select "w1 w3 w5 w6 w9" as doc
        union all  select "w0" as doc
        union all  
        select "w0 w0" as doc
        union all  
        select "w9 w1 w9 w1 w9" as doc
    )tmp;

    Sample data in the maple_test_pmi_basic_input table after you run the command:

    doc

    w1 w2 w3 w4 w5 w6 w7 w8 w8 w9

    w1 w3 w5 w6 w9

    w0

    w0 w0

    w9 w1 w9 w1 w9

  • Run the PAI command

    You can use an SQL script component or an ODPS SQL node to run the following PAI commands.

    PAI -name PointwiseMutualInformation    
        -project algo_public    
        -DinputTableName=maple_test_pmi_basic_input    
        -DdocColName=doc    
        -DoutputTableName=maple_test_pmi_basic_output    
        -DminCount=0    
        -DwindowSize=2    
        -DcoreNum=1    
        -DmemSizePerCore=110;
  • Output

    Sample output table maple_test_pmi_basic_output:

    word1

    word2

    word1_count

    word2_count

    co_occurrences_count

    pmi

    w0

    w0

    2

    2

    1

    2.0794415416798357

    w1

    w1

    10

    10

    1

    -1.1394342831883648

    w1

    w2

    10

    3

    1

    0.06453852113757116

    w1

    w3

    10

    7

    2

    -0.08961215868968704

    w1

    w5

    10

    8

    1

    -0.916290731874155

    w1

    w9

    10

    12

    4

    0.06453852113757116

    w2

    w3

    3

    7

    1

    0.4212134650763035

    w2

    w4

    3

    4

    1

    0.9808292530117262

    w3

    w4

    7

    4

    1

    0.13353139262452257

    w3

    w5

    7

    8

    2

    0.13353139262452257

    w3

    w6

    7

    7

    1

    -0.42608439531090014

    w4

    w5

    4

    8

    1

    0.0

    w4

    w6

    4

    7

    1

    0.13353139262452257

    w5

    w6

    8

    7

    2

    0.13353139262452257

    w5

    w7

    8

    4

    1

    0.0

    w5

    w9

    8

    12

    1

    -1.0986122886681098

    w6

    w7

    7

    4

    1

    0.13353139262452257

    w6

    w8

    7

    7

    1

    -0.42608439531090014

    w6

    w9

    7

    12

    1

    -0.9650808960435872

    w7

    w8

    4

    7

    2

    0.8266785731844679

    w8

    w8

    7

    7

    1

    -0.42608439531090014

    w8

    w9

    7

    12

    2

    -0.2719337154836418

    w9

    w9

    12

    12

    2

    -0.8109302162163288

References