String similarity calculation is a fundamental operation in machine learning. It assesses the similarity or difference between two strings. This calculation is widely used in fields such as information retrieval, natural language processing, and bioinformatics. It uses different algorithms and metrics, such as Levenshtein Distance and Cosine Similarity, to identify, match, or cluster similar text data.
Algorithm description
The String Similarity component supports five similarity calculation methods: Levenshtein (Levenshtein Distance), LCS (Longest Common Substring), SSK (String Subsequence Kernel), Cosine, and Simhash_Hamming. The component supports pairwise calculation.
-
The Levenshtein method supports distance and similarity calculations.
-
Distance is represented by the levenshtein parameter.
-
Similarity = 1 - Distance. Similarity is represented by the levenshtein_sim parameter.
-
-
The LCS method supports distance and similarity calculations.
-
Distance is represented by the lcs parameter.
-
Similarity = 1 - Distance. Similarity is represented by the lcs_sim parameter.
-
-
The SSK method supports similarity calculation. It is represented by the ssk parameter.
-
The Cosine method supports similarity calculation. It is represented by the cosine parameter.
-
The Simhash_Hamming method uses the SimHash algorithm to map the original text to a 64-bit binary fingerprint. The Hamming Distance is then used to calculate the number of different characters at the same position in the binary fingerprints. This method supports both distance and similarity calculations.
-
Distance is represented by the simhash_hamming parameter.
-
Similarity = 1 - Distance/64.0. Similarity is represented by the simhash_hamming_sim parameter.
-
Component Configuration
Method 1: Use the GUI
Add the String Similarity component to the Designer workflow. Then, configure the parameters in the right-side pane.
|
Parameter type |
Parameter |
Description |
|
Fields setting |
Columns to append to output table |
The columns to append to the output table. |
|
First column for similarity calculation |
The default value is the name of the first column of the STRING type in the table. |
|
|
Second column for similarity calculation |
The default value is the name of the second column of the STRING type in the table. |
|
|
Similarity column in output table |
The name of the similarity column in the output table. |
|
|
Parameters setting |
Similarity calculation method |
The method for similarity calculation. Valid values:
Default value: levenshtein_sim. |
|
Substring length |
This parameter is required only when the Similarity Calculation Method parameter is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). Default value: 2. |
|
|
Weight of matching string |
This parameter is required only when the Similarity Calculation Method parameter is set to ssk, simhash_hamming, or simhash_hamming_sim. Valid values: (0,1). Default value: 0.5. |
|
|
Execution tuning |
Number of cores for computing |
By default, it is assigned by the system. |
|
Memory size per core (MB) |
By default, it is automatically assigned. |
Method 2: Use PAI commands
You can use PAI commands to configure the String Similarity component. You can use the SQL script component to invoke PAI commands. For more information, see SQL Script.
PAI -name string_similarity
-project algo_public
-DinputTableName="pai_test_string_similarity"
-DoutputTableName="pai_test_string_similarity_output"
-DinputSelectedColName1="col0"
-DinputSelectedColName2="col1";
|
Parameter |
Required |
Default value |
Description |
|
inputTableName |
Yes |
None |
The name of the input table. |
|
outputTableName |
Yes |
None |
The name of the output table. |
|
inputSelectedColName1 |
No |
The name of the first column of the STRING type in the table |
The name of the first column for the similarity calculation. |
|
inputSelectedColName2 |
No |
The name of the second column of the STRING type in the table |
The second column for the similarity calculation. |
|
inputAppendColNames |
No |
None |
The columns to append to the output table. |
|
inputTablePartitions |
No |
All partitions |
The partitions of the input table. |
|
outputColName |
No |
output |
The name of the similarity column in the output table. The name cannot contain special characters. It can contain only letters (a-z, A-Z), digits, and underscores (_). It must start with a letter and be no more than 128 bytes in length. |
|
method |
No |
levenshtein_sim |
The method for similarity calculation. Valid values:
|
|
lambda |
No |
0.5 |
This parameter is required only when the Similarity Calculation Method parameter is set to ssk. Valid values: (0,1). |
|
k |
No |
2 |
This parameter is required only when the Method parameter is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). |
|
lifecycle |
No |
None |
The lifecycle of the output table. The value must be a positive integer. |
|
coreNum |
No |
The system automatically allocates resources. |
The number of cores for computing. |
|
memSizePerCore |
No |
System-assigned |
The memory size per core. |
References
-
For more information about Designer, see Designer overview.
-
You can also use the String Similarity-Top N component to calculate string similarity and retrieve the top N most similar data records. For more information about this component, see String Similarity-Top N.