This topic describes the Pointwise Mutual Information (PMI) component provided by Machine Learning Studio.

Mutual information (MI) is a measure of information in the information theory. It can be regarded as the amount of information that is contained in a random variable about another variable, or the reduction in uncertainty of a random variable due to the known random variable.

This algorithm is used to count the co-occurrence of all words in several documents and calculate the pointwise mutual information (PMI). PMI definition: PMI(x,y)=ln(p(x,y)/(p(x)p(y)))=ln(#(x,y)D/(#x#y)). In the definition, #(x,y) indicates the number of pairs (x,y). D indicates the total number of pairs. If x and y appear in the same window, the output is #x+=1; #y+=1;#(x,y)+=1. For more information about PMI, see PMI.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Columns of Documents with Words Separated with Spaces None
    Parameters Setting Minimum Frequency of Words Words that appear for a number of times less than this value are filtered out. Default value: 5.
    Window Size The window size. For example, the value 5 refers to the five words adjacent to the right of the current word. Words that appear in the window are considered related to the current word.
    Tuning Computing Cores The number of cores used for calculation. The value is automatically allocated.
    Memory Size per Core (Unit: MB) The size of memory required by each core. The value is automatically allocated.
  • PAI command
    PAI -name PointwiseMutualInformation    
        -project algo_public    
        -DinputTableName=maple_test_pmi_basic_input    
        -DdocColName=doc    
        -DoutputTableName=maple_test_pmi_basic_output    
        -DminCount=0    
        -DwindowSize=2    
        -DcoreNum=1    
        -DmemSizePerCore=110;
    Parameter Required Description Default value
    inputTableName Yes The input table. No default value
    outputTableName Yes The output table. No default value
    docColName Yes The name of the document column after word segmentation, where words are separated with spaces. No default value
    windowSize No The window size. For example, the value 5 refers to the five words adjacent to the right of the current word. Words that appear in the window are considered related to the current word. All content in a row
    minCount No The minimum frequency of words for truncation. Words that appear for a number of times less than this value are filtered out. 5
    inputTablePartitions No The partitions selected from the input table for training, in the format of Partition_name=value. To specify multiple partitions, use the following format: name1=value1/name2=value2. Separate these partitions with commas (,). Full table
    lifecycle No The lifecycle of the output table. No default value
    coreNum No The number of cores. Valid values: [1,9999]. Automatically allocated
    memSizePerCore No The memory size of each core. Unit: MB. Valid values: [1024,65536]. Automatically allocated

Examples

  • Generate data.
    create table maple_test_pmi_basic_input as
    select * from
    (  
        select "w1 w2 w3 w4 w5 w6 w7 w8 w8 w9" as doc from dual
        union all  
        select "w1 w3 w5 w6 w9" as doc from dual  
        union all  select "w0" as doc from dual  
        union all  
        select "w0 w0" as doc from dual  
        union all  
        select "w9 w1 w9 w1 w9" as doc from dual
    )tmp;
    doc:string
    w1 w2 w3 w4 w5 w6 w7 w8 w8 w9
    w1 w3 w5 w6 w9
    w0
    w0 w0
    w9 w1 w9 w1 w9
  • Run the PAI command.
    PAI -name PointwiseMutualInformation    
        -project algo_public    
        -DinputTableName=maple_test_pmi_basic_input    
        -DdocColName=doc    
        -DoutputTableName=maple_test_pmi_basic_output    
        -DminCount=0    
        -DwindowSize=2    
        -DcoreNum=1    
        -DmemSizePerCore=110;
  • Generate the output.
    word1 word2 word1_count word2_count co_occurrences_count pmi
    w0 w0 2 2 1 2.0794415416798357
    w1 w1 10 10 1 -1.1394342831883648
    w1 w2 10 3 1 0.06453852113757116
    w1 w3 10 7 2 -0.08961215868968704
    w1 w5 10 8 1 -0.916290731874155
    w1 w9 10 12 4 0.06453852113757116
    w2 w3 3 7 1 0.4212134650763035
    w2 w4 3 4 1 0.9808292530117262
    w3 w4 7 4 1 0.13353139262452257
    w3 w5 7 8 2 0.13353139262452257
    w3 w6 7 7 1 -0.42608439531090014
    w4 w5 4 8 1 0
    w4 w6 4 7 1 0.13353139262452257
    w5 w6 8 7 2 0.13353139262452257
    w5 w7 8 4 1 0
    w5 w9 8 12 1 -1.0986122886681098
    w6 w7 7 4 1 0.13353139262452257
    w6 w8 7 7 1 -0.42608439531090014
    w6 w9 7 12 1 -0.9650808960435872
    w7 w8 4 7 2 0.8266785731844679
    w8 w8 7 7 1 -0.42608439531090014
    w8 w9 7 12 2 -0.2719337154836418
    w9 w9 12 12 2 -0.8109302162163288