All Products
Search
Document Center

Platform For AI:Document Similarity

Last Updated:Feb 08, 2024

Document similarity is the similarity calculated between articles or sentences based on string similarity. Documents or sentences are separated by spaces. This topic describes how to configure the Document Similarity algorithm component provided by Platform for AI (PAI).

Background information

Document similarity is calculated in the same manner that string similarity is calculated. Document similarity supports the following calculation methods: Levenshtein Distance (Levenshtein), Longest Common SubString (LCS), String Subsequence Kernel (SSK), Cosine, and Simhash_Hamming.

  • The Levenshtein method supports the calculation of distance and similarity.

    • The distance is expressed as the levenshtein parameter.

    • The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is expressed as the levenshtein_sim parameter.

  • The LCS method supports the calculation of distance and similarity.

    • The distance is expressed as the lcs parameter.

    • The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is expressed as the lcs_sim parameter.

  • The SSK method supports similarity calculation and is expressed as the ssk parameter.

  • The Cosine method supports similarity calculation and is expressed as the cosine parameter.

  • In the Simhash_Hamming method, the SimHash algorithm is used to map the original documents to 64-bit binary fingerprints. The Hamming distance is used to calculate the number of characters of binary fingerprints on the same position. The Simhash_Hamming method supports distance and similarity calculation.

    • The distance is expressed as the simhash_hamming parameter.

    • The similarity is calculated by using the following formula: Similarity = 1 - Distance/64.0. The similarity is expressed as the simhash_hamming_sim parameter.

    Note

Limits

You can use the Document Similarity component based only on the computing resources of MaxCompute.

Configure the component

You can use one of the following methods to configure the Document Similarity component.

Method 1: Configure the component in the PAI console

You can configure the parameters of the Document Similarity component on the pipeline page of Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

First Column for Similarity Calculation

The default value is the name of the first string column in the table.

Second Column for Similarity Calculation

The default value is the name of the second string column in the table.

Columns Appended to Output Table

The names of the columns appended to the output table.

Similarity Column in Output Table

The name of the similarity column in the output table. Default value: output.

Note

The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter.

Parameters Setting

Similarity Calculation Method

The method that is used for similarity calculation. Valid values:

  • levenshtein

  • levenshtein_sim (default)

  • lcs

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming

  • simhash_hamming_sim

Substring Length (Available in SSK and Cosine)

This parameter takes effect only when the Similarity Calculation Method parameter is set to levenshtein, ssk, or Cosine. Valid values: (0,100]. Default value: 2.

Matching Word Pair Weight (Available in SSK)

This parameter takes effect only when the Similarity Calculation Method parameter is set to ssk. The value must be between 0 and 1. Default value: 0.5.

Tuning

Computing Cores

The number of cores used for calculation. By default, the system determines the value.

Memory Size per Core (Unit: MB)

The memory size of each core. By default, the system determines the value.

Method 2: Configure the parameters by using PAI commands

Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.

PAI -name doc_similarity    
    -project algo_public    
    -DinputTableName="pai_test_doc_similarity"    
    -DoutputTableName="pai_test_doc_similarity_output"    
    -DinputSelectedColName1="col0"    
    -DinputSelectedColName2="col1"

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

N/A

outputTableName

Yes

The name of the output table.

N/A

inputSelectedColName1

No

The first column that is used for similarity calculation.

The name of the first string column in the table

inputSelectedColName2

No

The second column that is used for similarity calculation.

The name of the second string column in the table

inputAppendColNames

No

The columns appended to the output table.

No column appended

inputTablePartitions

No

The partitions that are selected from the input table.

Full table

outputColName

No

The name of the similarity column in the output table.

Note

The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter.

output

method

No

The method that is used for similarity calculation. Valid values:

  • levenshtein

  • levenshtein_sim

  • lcs

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming

  • simhash_hamming_sim

levenshtein_sim

lambda

No

The weight of a matched word pair. The SSK method supports this parameter. Valid values: (0,1).

0.5

k

No

The length of the substring. The SSK and Cosine methods support this parameter. Valid values: (0,100].

2

lifecycle

No

The lifecycle of the output table.

N/A

coreNum

No

The number of cores that are used for calculation.

Automatically allocated

memSizePerCore

No

The memory size of each core. Unit: MB.

Automatically allocated

Examples

  • Input

    Use a ODPS SQL node to create a table pai_ft_string_similarity_topn_input. For more information, see Develop a MaxCompute SQL task. Sample command:

    drop table if exists pai_doc_similarity_input;
    create table pai_doc_similarity_input as
    select * from 
    (
    select 0 as id, "Beijing Shanghai" as col0, "Beijing Shanghai" as col1 from dual
    union all
    select 1 as id, "Beijing Shanghai" as col0, "Beijing Shanghai Shenzhen" as col1 from dual
    )tmp

    After you run the command, the following table is the input table pai_doc_similarity_input:

    id

    col0

    col1

    1

    0

  • PAI command

    You can use an SQL script component or an ODPS SQL node to run the following PAI commands.

    drop table if exists pai_doc_similarity_output;
    PAI -name doc_similarity    
        -project algo_public    
        -DinputTableName=pai_doc_similarity_input    
        -DoutputTableName=pai_doc_similarity_output    
        -DinputSelectedColName1=col0    
        -DinputSelectedColName2=col1    
        -Dmethod=levenshtein_sim    
        -DinputAppendColNames=id,col0,col1;
  • Output

    The following table is the output table named pai_doc_similarity_output.

    id

    col0

    col1

    output

    1

    Beijing Shanghai

    Beijing Shanghai Shenzhen

    0.6666666666666667

    0

    Beijing Shanghai

    Beijing Shanghai

    1.0

FAQ

  • Similarity calculation is based on the result of word segmentation. Words are separated by spaces. Each word serves as a unit of similarity calculation. If the input is a string as a whole, use the string similarity method.

  • In the method parameter, levenshtein, lcs, and simhash_hamming are used to calculate the distance. levenshtein_sim, lcs_sim, ssk, cosine, and simhash_hamming_sim are used to calculate the similarity. The distance is calculated by using the following formula: Distance = 1.0 - Similarity.

  • If you set the method parameter to cosine or ssk, the k parameter is available, which indicates that k words are used as a combination for similarity calculation. If the value of k is greater than the number of words, two strings are the same. The similarity output is 0. In this case, you need to change the value of k to a value less than or equal to the minimum number of words.

References