All Products
Search
Document Center

Platform For AI:String Similarity

Last Updated:Feb 08, 2024

String similarity calculation is a basic machine learning operation that is commonly used in industries such as information retrieval, natural language processing, and bioinformatics. This topic describes how to configure the String Similarity algorithm component in Platform for AI (PAI).

Background information

The component supports five calculation methods: Levenshtein (Levenshtein Distance), Longest Common SubString (LCS), String Subsequence Kernel (SSK), Cosine (Cosine), and SimHash_Hamming. Input data may be distributed in two columns, and the value in one column can be used to calculate the value in the other column.

  • The Levenshtein method supports distance and similarity calculation.

    • The distance is specified by the levenshtein parameter.

    • The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is specified by the levenshtein_sim parameter.

  • The LCS method supports distance and similarity calculation.

    • The distance is specified by the lcs parameter.

    • The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is specified by the lcs_sim parameter.

  • The SSK method supports similarity calculation, which is specified by the ssk parameter.

  • Cosine supports similarity calculation, which is specified by the cosine parameter.

  • In Simhash_Hamming, the SimHash algorithm is used to map the original documents to 64-bit binary fingerprints. The Hamming distance is used to calculate the number of characters of binary fingerprints in the same position. The Simhash_Hamming method supports distance and similarity calculation.

    • The distance is specified by the simhash_hamming parameter.

    • The similarity is calculated by using the following formula: Similarity = 1 - Distance/64.0. The similarity is specified by the simhash_hamming_sim parameter.

Configure the component

You can use one of the following methods to configure the parameters of the String Similarity component.

Method 1: Configure the component in the PAI console

You can configure the parameters of the String Similarity component in Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Columns Appended Output Table

The columns appended to the specified output table.

First Column for Similarity Calculation

The default value is the first STRING column in the input table.

Second Column for Similarity Calculation

The default value is the second STRING column in the input table.

Similarity Columns in Output Table

The similarity column in the specified output table.

Parameters Setting

Similarity Calculation Method

The method that is used for similarity calculation. Valid values:

  • levenshtein

  • levenshtein_sim

  • lcs

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming

  • simhash_hamming_sim

Default value: levenshtein_sim.

Substring Length

This parameter is required only when the Similarity Calculation Method parameter is set to ssk, Cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). Default value: 2.

Weight of Matching String

This parameter is required only when the method parameter is set to ssk, cosine, or simhash_hamming_sim. Valid values: (0,1). Default value: 0.5.

Execution Tuning

Number of Computing Cores

The number of computing cores. By default, the system determines the value.

Memory Size per Core (MB)

The memory size of each core. By default, the system determines the value.

Method 2: Configure the component by using PAI commands

Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.

PAI -name string_similarity
    -project algo_public
    -DinputTableName="pai_test_string_similarity"
    -DoutputTableName="pai_test_string_similarity_output"
    -DinputSelectedColName1="col0"
    -DinputSelectedColName2="col1";

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

N/A

outputTableName

Yes

The name of the output table.

N/A

inputSelectedColName1

No

The first column for similarity calculation.

The name of the first column of the STRING type in the left table

inputSelectedColName2

No

The second column for similarity calculation.

The second STRING column in the input table

inputAppendColNames

No

The columns appended to the output table.

N/A

inputTablePartitions

No

The partitions in the input table.

All partitions

outputColName

No

The name of the similarity column in the output table. The value cannot contain special characters. It can contain only letters, digits, or underscores (_) and must start with a letter and can be up to 128 bytes in length.

output

method

No

The method that is used for similarity calculation. Valid values:

  • levenshtein

  • levenshtein_sim

  • lcs

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming

  • simhash_hamming_sim

levenshtein_sim

lambda

No

This parameter is required only when the Method parameter is set to ssk. Valid values: (0,1).

0.5

k

No

This parameter is required only when the Method parameter is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100).

2

lifecycle

No

The lifecycle of the output table. The value must be a positive integer.

N/A

coreNum

No

The number of cores that are used in computing.

Automatically allocated

memSizePerCore

No

The memory size of each core.

Automatically allocated

References

  • For information about Machine Learning Designer, see Overview of Machine Learning Designer.

  • You can also use the String Similarity - top N component to calculate string similarity and obtain the top N data records that best match the mapping table. For information about how to use this component, see String Similarity - top N.