This topic describes the Document Similarity component provided by Machine Learning Studio.
Document similarity is based on string similarity and calculates the similarity between
articles or between sentences. Documents or sentences are separated by spaces. The
document similarity is calculated the same way as the string similarity. The document
similarity supports the following calculation methods: Levenshtein Distance (Levenshtein),
Longest Common SubString (LCS), String Subsequence Kernel (SSK), Cosine, and Simhash_Hamming.
- The Levenshtein method supports distance and similarity calculation.
- The distance is expressed as the levenshtein parameter.
- The similarity is calculated based on the following formula: Similarity = 1 - Distance. The similarity is expressed as the levenshtein_sim parameter.
- The lCS method supports distance and similarity calculation.
- The distance is expressed as the lcs parameter.
- The similarity is calculated based on the following formula: Similarity = 1 - Distance. The similarity is expressed as the lcs_sim parameter.
- The SSK method supports similarity calculation and is expressed as the ssk parameter.
- The Cosine method supports similarity calculation and is expressed as the cosine parameter.
- In Simhash_Hamming, the SimHash algorithm is to map the original documents to 64-bit
binary fingerprints. The Hamming distance is used to calculate the number of characters
of binary fingerprints on the same position. The Simhash_Hamming method supports distance
and similarity calculation.
- The distance is expressed as the simhash_hamming parameter.
- The similarity is calculated based on the following formula: Similarity = 1 - Distance/64.0. The similarity is expressed as the simhash_hamming_sim parameter.
Note- For more information about SimHash, see Similarity Estimation Techniques from Rounding Algorithms.
- For more information about the Hamming distance, see Wikipedia.
Configure the component
You can configure the component by using one of the following methods:
- Machine Learning Platform For AI console
Tab Parameter Description Fields Setting First Column for Similarity Calculation The default value is the name of the first string column in the table. Second Column for Similarity Calculation The default value is the name of the second string column in the table. Columns Appended Output Table The names of columns appended to the output table. Similarity Column in Output Table The name of the similarity column in the output table. Default value: output. Note The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter.Parameters Setting Similarity Calculation Method The similarity calculation method. Valid values: - levenshtein
- levenshtein_sim
- lcs
- lcs_sim
- ssk
- cosine
- simhash_hamming
- simhash_hamming_sim
Substring Length (Available in SSK and Cosine) This parameter appears only when the Similarity Calculation Method parameter is set to levenshtein, ssk, or cosine. Valid values: (0,100). Default value: 2. Matching Word Pair Weight (Available in SSK and Cosine) This parameter appears only when the Similarity Calculation Method parameter is set to ssk. Valid values: (0,1). Default value: 0.5. Tuning Computing Cores The number of the cores used for calculation. The value is automatically allocated. Memory Size per Core (Unit: MB) The size of the memory required by each core. The value is automatically allocated. - PAI command
PAI -name doc_similarity -project algo_public -DinputTableName="pai_test_doc_similarity" -DoutputTableName="pai_test_doc_similarity_output" -DinputSelectedColName1="col0" -DinputSelectedColName2="col1"
Parameter Required Description Default value inputTableName Yes The name of the input table. No default value outputTableName Yes The name of the output table. No default value inputSelectedColName1 No The name of the first column for similarity calculation. The name of the first string column in the table inputSelectedColName2 No The name of the second column for similarity calculation. The name of the second string column in the table inputAppendColNames No The name of the column appended to the output table. No column appended by default inputTablePartitions No The partitions that are selected from the input table. Full table outputColName No The name of the similarity column in the output table. Note The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter.output method No The similarity calculation method. Valid values: - levenshtein
- levenshtein_sim
- lcs
- lcs_sim
- ssk
- cosine
- simhash_hamming
- simhash_hamming_sim
levenshtein_sim lambda No The weight of a matched word pair. The SSK method supports this parameter. Valid values: (0,1). 0.5 k No The length of the substring. The SSK and Cosine methods support this parameter. Valid values: (0,100). 2 lifecycle No The lifecycle of the output table. No default value coreNum No The number of the cores used for calculation. Automatically allocated memSizePerCore No The size of the memory required by each core. Unit: MB. Automatically allocated
Examples
- Generate data.
drop table if exists pai_doc_similarity_input; create table pai_doc_similarity_input as select * from ( select 0 as id, "Beijing Shanghai" as col0, "Beijing Shanghai" as col1 from dual union all select 1 as id, "Beijing Shanghai" as col0, "Beijing Shanghai Shenzhen" as col1 from dual )tmp
The following table is the input table pai_doc_similarity_input.id col0 col1 1 Beijing Shanghai Beijing Shanghai Shenzhen 0 Beijing Shanghai Beijing Shanghai - Run the PAI command.
drop table if exists pai_doc_similarity_output; PAI -name doc_similarity -project algo_public -DinputTableName=pai_doc_similarity_input -DoutputTableName=pai_doc_similarity_output -DinputSelectedColName1=col0 -DinputSelectedColName2=col1 -Dmethod=levenshtein_sim -DinputAppendColNames=id,col0,col1;
- Generate the output.
The following table is the output table pai_doc_similarity_output.
id col0 col1 output 1 Beijing Shanghai Beijing Shanghai Shenzhen 0.6666666666666667 0 Beijing Shanghai Beijing Shanghai 1.0
FAQ
- Similarity calculation is based on the result of word segmentation. Words are separated by spaces, and each word serves as a unit of similarity calculation. If the input is a string as a whole, use the string similarity method.
- In the method parameter, levenshtein, lcs, and simhash_hamming are used to calculate the distance. levenshtein_sim, lcs_sim, ssk, cosine, and simhash_hamming_sim are used to calculate the similarity. The distance is calculated based on the following formula: Distance = 1.0 - Similarity.
- When the method parameter is set to cosine or ssk, the k parameter is available, indicating that k words are used as a combination for similarity calculation. If the value of k is greater than the number of words, two strings are the same. The similarity output is 0. In this case, you need to change the value of k to a value less than or equal to the minimum number of words.