This topic describes the Document Similarity component provided by Machine Learning Studio.

Document similarity is based on string similarity and calculates the similarity between articles or between sentences. Documents or sentences are separated by spaces. The document similarity is calculated the same way as the string similarity. The document similarity supports the following calculation methods: Levenshtein Distance (Levenshtein), Longest Common SubString (LCS), String Subsequence Kernel (SSK), Cosine, and Simhash_Hamming.
  • The Levenshtein method supports distance and similarity calculation.
    • The distance is expressed as the levenshtein parameter.
    • The similarity is calculated based on the following formula: Similarity = 1 - Distance. The similarity is expressed as the levenshtein_sim parameter.
  • The lCS method supports distance and similarity calculation.
    • The distance is expressed as the lcs parameter.
    • The similarity is calculated based on the following formula: Similarity = 1 - Distance. The similarity is expressed as the lcs_sim parameter.
  • The SSK method supports similarity calculation and is expressed as the ssk parameter.
  • The Cosine method supports similarity calculation and is expressed as the cosine parameter.
  • In Simhash_Hamming, the SimHash algorithm is to map the original documents to 64-bit binary fingerprints. The Hamming distance is used to calculate the number of characters of binary fingerprints on the same position. The Simhash_Hamming method supports distance and similarity calculation.
    • The distance is expressed as the simhash_hamming parameter.
    • The similarity is calculated based on the following formula: Similarity = 1 - Distance/64.0. The similarity is expressed as the simhash_hamming_sim parameter.
    Note

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform For AI console
    Tab Parameter Description
    Fields Setting First Column for Similarity Calculation The default value is the name of the first string column in the table.
    Second Column for Similarity Calculation The default value is the name of the second string column in the table.
    Columns Appended Output Table The names of columns appended to the output table.
    Similarity Column in Output Table The name of the similarity column in the output table. Default value: output.
    Note The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter.
    Parameters Setting Similarity Calculation Method The similarity calculation method. Valid values:
    • levenshtein
    • levenshtein_sim
    • lcs
    • lcs_sim
    • ssk
    • cosine
    • simhash_hamming
    • simhash_hamming_sim
    Default value: levenshtein_sim.
    Substring Length (Available in SSK and Cosine) This parameter appears only when the Similarity Calculation Method parameter is set to levenshtein, ssk, or cosine. Valid values: (0,100). Default value: 2.
    Matching Word Pair Weight (Available in SSK and Cosine) This parameter appears only when the Similarity Calculation Method parameter is set to ssk. Valid values: (0,1). Default value: 0.5.
    Tuning Computing Cores The number of the cores used for calculation. The value is automatically allocated.
    Memory Size per Core (Unit: MB) The size of the memory required by each core. The value is automatically allocated.
  • PAI command
    PAI -name doc_similarity    
        -project algo_public    
        -DinputTableName="pai_test_doc_similarity"    
        -DoutputTableName="pai_test_doc_similarity_output"    
        -DinputSelectedColName1="col0"    
        -DinputSelectedColName2="col1"
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    outputTableName Yes The name of the output table. No default value
    inputSelectedColName1 No The name of the first column for similarity calculation. The name of the first string column in the table
    inputSelectedColName2 No The name of the second column for similarity calculation. The name of the second string column in the table
    inputAppendColNames No The name of the column appended to the output table. No column appended by default
    inputTablePartitions No The partitions that are selected from the input table. Full table
    outputColName No The name of the similarity column in the output table.
    Note The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter.
    output
    method No The similarity calculation method. Valid values:
    • levenshtein
    • levenshtein_sim
    • lcs
    • lcs_sim
    • ssk
    • cosine
    • simhash_hamming
    • simhash_hamming_sim
    levenshtein_sim
    lambda No The weight of a matched word pair. The SSK method supports this parameter. Valid values: (0,1). 0.5
    k No The length of the substring. The SSK and Cosine methods support this parameter. Valid values: (0,100). 2
    lifecycle No The lifecycle of the output table. No default value
    coreNum No The number of the cores used for calculation. Automatically allocated
    memSizePerCore No The size of the memory required by each core. Unit: MB. Automatically allocated

Examples

  • Generate data.
    drop table if exists pai_doc_similarity_input;
    create table pai_doc_similarity_input as
    select * from 
    (
    select 0 as id, "Beijing Shanghai" as col0, "Beijing Shanghai" as col1 from dual
    union all
    select 1 as id, "Beijing Shanghai" as col0, "Beijing Shanghai Hong Kong (China)" as col1 from dual
    )tmp
    The following table is the input table pai_doc_similarity_input.
    id col0 col1
    1 Beijing Shanghai Beijing Shanghai Hong Kong (China)
    0 Beijing Shanghai Beijing Shanghai
  • Run the PAI command.
    drop table if exists pai_doc_similarity_output;
    PAI -name doc_similarity    
        -project algo_public    
        -DinputTableName=pai_doc_similarity_input    
        -DoutputTableName=pai_doc_similarity_output    
        -DinputSelectedColName1=col0    
        -DinputSelectedColName2=col1    
        -Dmethod=levenshtein_sim    
        -DinputAppendColNames=id,col0,col1;
  • Generate the output.
    The following table is the output table pai_doc_similarity_output.
    id col0 col1 output
    1 Beijing Shanghai Beijing Shanghai Hong Kong (China) 0.6666666666666667
    0 Beijing Shanghai Beijing Shanghai 1.0

FAQ

  • Similarity calculation is based on the result of word segmentation. Words are separated by spaces, and each word serves as a unit of similarity calculation. If the input is a string as a whole, use the string similarity method.
  • In the method parameter, levenshtein, lcs, and simhash_hamming are used to calculate the distance. levenshtein_sim, lcs_sim, ssk, cosine, and simhash_hamming_sim are used to calculate the similarity. The distance is calculated based on the following formula: Distance = 1.0 - Similarity.
  • When the method parameter is set to cosine or ssk, the k parameter is available, indicating that k words are used as a combination for similarity calculation. If the value of k is greater than the number of words, two strings are the same. The similarity output is 0. In this case, you need to change the value of k to a value less than or equal to the minimum number of words.