This topic describes the String Similarity component provided by Machine Learning Studio.

String similarity calculation is a basic operation in machine learning. It is typically used in industries such as information retrieval, natural language processing, and bioinformatics. The component supports five calculation methods: Levenshtein (Levenshtein Distance), Longest Common SubString (lCS), String Subsequence Kernel (SSK), Cosine (Cosine), and SimHash_Hamming. Input data may be distributed in two columns, and the value in one column can be used to calculate that in the other column.
  • Levenshtein supports the calculation of distance and similarity.
    • Distance is indicated by the levenshtein parameter.
    • Similarity = 1 - Distance. Similarity is indicated by the levenshtein_sim parameter.
  • lCS supports the calculation of distance and similarity.
    • Distance is indicated by the lcs parameter.
    • Similarity = 1 - Distance. Similarity is indicated by the lcs_sim parameter.
  • SSK supports the calculation of similarity, which is indicated by the ssk parameter.
  • Cosine supports the calculation of similarity, which is indicated by the cosine parameter.
  • In the SimHash_Hamming method, the SimHash algorithm maps the original text to a 64-bit binary fingerprint, and the Hamming Distance algorithm calculates the number of different characters of the binary fingerprint in the same position, the distance, and the similarity.
    • Distance is indicated by the simhash_hamming parameter.
    • Similarity = 1 - Distance/64.0. Similarity is indicated by the simhash_hamming_sim parameter.

You can configure the component by using the Machine Learning Platform for AI console or a PAI command.

Configure the component

  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Columns Appended to Output Table The columns appended to the specified output table.
    First Column for Similarity Calculation The default value is the first STRING column in the input table.
    Second Column for Similarity Calculation The default value is the second STRING column in the input table.
    Similarity Columns in Output Table The similarity column in the specified output table.
    Parameters Setting Similarity Calculation Method The similarity calculation method. Valid values:
    • levenshtein
    • levenshtein_sim
    • lcs
    • lcs_sim
    • ssk
    • cosine
    • simhash_hamming
    • simhash_hamming_sim
    Default value: levenshtein_sim.
    Substring Length This parameter is required only when Similarity Calculation Method is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). Default value: 2.
    Weight of Matching String This parameter is required only when Similarity Calculation Method is set to ssk. Valid values: (0,1). Default value: 0.5.
    Tuning Computing Cores Automatically allocated.
    Memory Size per Core (Unit: MB) Automatically allocated.
  • PAI command
    PAI -name string_similarity
        -project algo_public
        -DinputTableName="pai_test_string_similarity"
        -DoutputTableName="pai_test_string_similarity_output"
        -DinputSelectedColName1="col0"
        -DinputSelectedColName2="col1";
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    outputTableName Yes The name of the output table. No default value
    inputSelectedColName1 No The first column for similarity calculation. The first STRING column in the input table
    inputSelectedColName2 No The second column for similarity calculation. The second STRING column in the input table
    inputAppendColNames No The columns appended to the output table. No default value
    inputTablePartitions No The partitions in the input table. Full table
    outputColName No The name of the similarity column in the output table. A column name cannot contain special characters. It can contain only letters, digits, or underscores (_). A name must start with a letter and can be up to 128 bytes in length. output
    method No The similarity calculation method. Valid values:
    • levenshtein
    • levenshtein_sim
    • lcs
    • lcs_sim
    • ssk
    • cosine
    • simhash_hamming
    • simhash_hamming_sim
    levenshtein_sim
    lambda No This parameter is required only when Method is set to ssk. Valid values: (0,1). 0.5
    k No This parameter is required only when Method is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). 2
    lifecycle No The lifecycle of the output table. The value must be a positive integer. No default value
    coreNum No The number of cores involved in computing. Automatically allocated
    memSizePerCore No The memory for each core. Automatically allocated