This topic describes the String Similarity - Top N component that is provided by Machine Learning Studio.

The String Similarity - Top N component is used to calculate the string similarity and obtain the Top N data records that best match the mapping table.

You can configure the component by using the Machine Learning Platform for AI console or running commands.

Configure the component

  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Columns Appended to Output Table from Input Table The names of the columns to be appended to the output table from the input table.
    Columns Appended to Output Table from Mapping Table The names of the columns to be appended to the output table from the mapping table.
    Columns from Left Table for Similarity Calculation The names of the left-table columns that are used for similarity calculation.
    Columns from Mapping Table for Similarity Calculation The names of the mapping-table columns that are used for similarity calculation. The similarities between the rows in the left table and all strings in the mapping table are calculated and the top N results are provided.
    Similarity Columns in Output Table The name of the similarity column in the output table. The value cannot contain special characters. It can contain only letters, digits, and underscores (_). It must start with a letter and can be up to 128 bytes in length. Default value: output.
    Parameters Setting Number of Highest Similarity Scores The number of top N similarity values. The value must be a positive integer. Default value: 10.
    Similarity Calculation Method The method that is used for similarity calculation. Valid values:
    • levenshtein_sim
    • lcs_sim
    • ssk
    • cosine
    • simhash_hamming_sim
    Default value: levenshtein_sim.
    Substring Length This parameter is required only when the Similarity Calculation Method parameter is set to ssk, cosine, or simhash_hamming_sim. Valid values: (0,100). Default value: 2.
    Weight of Matching String This parameter is required only when the Similarity Calculation Method parameter is set to ssk. Valid values: (0,1). Default value: 0.5.
    Tuning Computing Cores The number of cores that are used for calculation. By default, the cores are allocated by the system.
    Memory Size per Core (Unit: MB) The memory size of each core, in MB. By default, the memory is allocated by the system.
  • Use commands
    PAI -name string_similarity_topn
        -project algo_public
        -DinputTableName="pai_test_string_similarity_topn"
        -DoutputTableName="pai_test_string_similarity_topn_output"
        -DmapTableName="pai_test_string_similarity_map_topn"
        -DinputSelectedColName="col0"
        -DmapSelectedColName="col1";
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. N/A
    mapTableName Yes The name of the mapping table. N/A
    outputTableName Yes The name of the output table. N/A
    inputSelectedColName1 No The names of the left-table columns that are used for similarity calculation. The name of the first column of the STRING type in the left table
    inputSelectedColName2 No The names of the mapping-table columns that are used for similarity calculation. The name of the first column of the STRING type in the mapping table
    inputAppendColNames No The names of the columns to be appended to the output table from the input table. N/A
    inputAppendRenameColNames No The aliases of the columns to be appended to the output table from the input table. N/A
    mapSelectedColName Yes The names of the mapping-table columns that are used for similarity calculation. N/A
    mapAppendColNames No The names of the columns to be appended to the output table from the mapping table. N/A
    mapAppendRenameColNames No The aliases of the columns to be appended to the output table from the mapping table. N/A
    inputTablePartitions No The names of the partitions in the input table. The names of all partitions
    mapTablePartitions No The names of the partitions in the mapping table. The names of all partitions
    outputColName No The name of the similarity column in the output table. The value cannot contain special characters. It can contain only letters, digits, and underscores (_). It must start with a letter and can be up to 128 bytes in length. output
    method No The method that is used for similarity calculation. Valid values:
    • levenshtein_sim
    • lcs_sim
    • ssk
    • cosine
    • simhash_hamming_sim
    levenshtein_sim
    lambda No This parameter is required only when the Similarity Calculation Method parameter is set to ssk. Valid values: (0,1). 0.5
    k No This parameter is required only when the Similarity Calculation Method parameter is set to ssk, cosine, or simhash_hamming_sim. Valid values: (0,100). 2
    lifecycle No The lifecycle of the output table. The value must be a positive integer. N/A
    coreNum No The number of cores that are used for calculation. Allocated by the system
    memSizePerCore No The memory size of each core. Allocated by the system