This topic describes the String Similarity component provided by Machine Learning Studio.
String similarity calculation is a basic operation in machine learning. It is typically
used in industries such as information retrieval, natural language processing, and
bioinformatics. The component supports five calculation methods: Levenshtein (Levenshtein
Distance), Longest Common SubString (lCS), String Subsequence Kernel (SSK), Cosine
(Cosine), and SimHash_Hamming. Input data may be distributed in two columns, and the
value in one column can be used to calculate that in the other column.
- Levenshtein supports the calculation of distance and similarity.
- Distance is indicated by the levenshtein parameter.
- Similarity = 1 - Distance. Similarity is indicated by the levenshtein_sim parameter.
- lCS supports the calculation of distance and similarity.
- Distance is indicated by the lcs parameter.
- Similarity = 1 - Distance. Similarity is indicated by the lcs_sim parameter.
- SSK supports the calculation of similarity, which is indicated by the ssk parameter.
- Cosine supports the calculation of similarity, which is indicated by the cosine parameter.
- In the SimHash_Hamming method, the SimHash algorithm maps the original text to a 64-bit
binary fingerprint, and the Hamming Distance algorithm calculates the number of different
characters of the binary fingerprint in the same position, the distance, and the similarity.
- Distance is indicated by the simhash_hamming parameter.
- Similarity = 1 - Distance/64.0. Similarity is indicated by the simhash_hamming_sim parameter.
You can configure the component by using the Machine Learning Platform for AI console or a PAI command.
Configure the component
- Machine Learning Platform for AI console
Tab Parameter Description Fields Setting Columns Appended to Output Table The columns appended to the specified output table. First Column for Similarity Calculation The default value is the first STRING column in the input table. Second Column for Similarity Calculation The default value is the second STRING column in the input table. Similarity Columns in Output Table The similarity column in the specified output table. Parameters Setting Similarity Calculation Method The similarity calculation method. Valid values: - levenshtein
- levenshtein_sim
- lcs
- lcs_sim
- ssk
- cosine
- simhash_hamming
- simhash_hamming_sim
Substring Length This parameter is required only when Similarity Calculation Method is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). Default value: 2. Weight of Matching String This parameter is required only when Similarity Calculation Method is set to ssk. Valid values: (0,1). Default value: 0.5. Tuning Computing Cores Automatically allocated. Memory Size per Core (Unit: MB) Automatically allocated. - PAI command
PAI -name string_similarity -project algo_public -DinputTableName="pai_test_string_similarity" -DoutputTableName="pai_test_string_similarity_output" -DinputSelectedColName1="col0" -DinputSelectedColName2="col1";
Parameter Required Description Default value inputTableName Yes The name of the input table. No default value outputTableName Yes The name of the output table. No default value inputSelectedColName1 No The first column for similarity calculation. The first STRING column in the input table inputSelectedColName2 No The second column for similarity calculation. The second STRING column in the input table inputAppendColNames No The columns appended to the output table. No default value inputTablePartitions No The partitions in the input table. Full table outputColName No The name of the similarity column in the output table. A column name cannot contain special characters. It can contain only letters, digits, or underscores (_). A name must start with a letter and can be up to 128 bytes in length. output method No The similarity calculation method. Valid values: - levenshtein
- levenshtein_sim
- lcs
- lcs_sim
- ssk
- cosine
- simhash_hamming
- simhash_hamming_sim
levenshtein_sim lambda No This parameter is required only when Method is set to ssk. Valid values: (0,1). 0.5 k No This parameter is required only when Method is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). 2 lifecycle No The lifecycle of the output table. The value must be a positive integer. No default value coreNum No The number of cores involved in computing. Automatically allocated memSizePerCore No The memory for each core. Automatically allocated