Configure the String Similarity component - Platform For AI:String Similarity

String similarity calculation is a fundamental operation in machine learning. It assesses the similarity or difference between two strings. This calculation is widely used in fields such as information retrieval, natural language processing, and bioinformatics. It uses different algorithms and metrics, such as Levenshtein Distance and Cosine Similarity, to identify, match, or cluster similar text data.

Algorithm description

The String Similarity component supports five similarity calculation methods: Levenshtein (Levenshtein Distance), LCS (Longest Common Substring), SSK (String Subsequence Kernel), Cosine, and Simhash_Hamming. The component supports pairwise calculation.

The Levenshtein method supports distance and similarity calculations.
- Distance is represented by the levenshtein parameter.
- Similarity = 1 - Distance. Similarity is represented by the levenshtein_sim parameter.
The LCS method supports distance and similarity calculations.
- Distance is represented by the lcs parameter.
- Similarity = 1 - Distance. Similarity is represented by the lcs_sim parameter.
The SSK method supports similarity calculation. It is represented by the ssk parameter.
The Cosine method supports similarity calculation. It is represented by the cosine parameter.
The Simhash_Hamming method uses the SimHash algorithm to map the original text to a 64-bit binary fingerprint. The Hamming Distance is then used to calculate the number of different characters at the same position in the binary fingerprints. This method supports both distance and similarity calculations.
- Distance is represented by the simhash_hamming parameter.
- Similarity = 1 - Distance/64.0. Similarity is represented by the simhash_hamming_sim parameter.

Component Configuration

Method 1: Use the GUI

Add the String Similarity component to the Designer workflow. Then, configure the parameters in the right-side pane.

Parameter type	Parameter	Description
Fields setting	Columns to append to output table	The columns to append to the output table.
	First column for similarity calculation	The default value is the name of the first column of the STRING type in the table.
	Second column for similarity calculation	The default value is the name of the second column of the STRING type in the table.
	Similarity column in output table	The name of the similarity column in the output table.
Parameters setting	Similarity calculation method	The method for similarity calculation. Valid values: levenshtein levenshtein_sim lcs lcs_sim ssk cosine simhash_hamming simhash_hamming_sim Default value: levenshtein_sim.
	Substring length	This parameter is required only when the Similarity Calculation Method parameter is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). Default value: 2.
	Weight of matching string	This parameter is required only when the Similarity Calculation Method parameter is set to ssk, simhash_hamming, or simhash_hamming_sim. Valid values: (0,1). Default value: 0.5.
Execution tuning	Number of cores for computing	By default, it is assigned by the system.
Execution tuning	Memory size per core (MB)	By default, it is automatically assigned.

Method 2: Use PAI commands

You can use PAI commands to configure the String Similarity component. You can use the SQL script component to invoke PAI commands. For more information, see SQL Script.

PAI -name string_similarity
    -project algo_public
    -DinputTableName="pai_test_string_similarity"
    -DoutputTableName="pai_test_string_similarity_output"
    -DinputSelectedColName1="col0"
    -DinputSelectedColName2="col1";

Parameter	Required	Default value	Description
inputTableName	Yes	None	The name of the input table.
outputTableName	Yes	None	The name of the output table.
inputSelectedColName1	No	The name of the first column of the STRING type in the table	The name of the first column for the similarity calculation.
inputSelectedColName2	No	The name of the second column of the STRING type in the table	The second column for the similarity calculation.
inputAppendColNames	No	None	The columns to append to the output table.
inputTablePartitions	No	All partitions	The partitions of the input table.
outputColName	No	output	The name of the similarity column in the output table. The name cannot contain special characters. It can contain only letters (a-z, A-Z), digits, and underscores (_). It must start with a letter and be no more than 128 bytes in length.
method	No	levenshtein_sim	The method for similarity calculation. Valid values: levenshtein levenshtein_sim lcs lcs_sim ssk cosine simhash_hamming simhash_hamming_sim
lambda	No	0.5	This parameter is required only when the Similarity Calculation Method parameter is set to ssk. Valid values: (0,1).
k	No	2	This parameter is required only when the Method parameter is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100).
lifecycle	No	None	The lifecycle of the output table. The value must be a positive integer.
coreNum	No	The system automatically allocates resources.	The number of cores for computing.
memSizePerCore	No	System-assigned	The memory size per core.

References

For more information about Designer, see Designer overview.
You can also use the String Similarity-Top N component to calculate string similarity and retrieve the top N most similar data records. For more information about this component, see String Similarity-Top N.