Calculates string similarity between input strings and mapping table entries, then returns the top N matches for each input string.
Configuration
Configure the component on the Designer workflow page or using PAI commands.
Configure via GUI
Configure parameters on the Designer workflow page.
|
Tab |
Parameter |
Description |
|
Field settings |
Columns to append from input table |
Columns from the input table to include in the output. |
|
Columns to append from mapping table |
Columns from the mapping table to include in the output. |
|
|
Left table column for similarity calculation |
Column from the left table to use for similarity calculation. |
|
|
Mapping table column for similarity calculation |
Column from the mapping table to use for similarity calculation. The component calculates similarity between each row in the left table and all strings in the mapping table, then returns the top N results. |
|
|
Similarity column name in output table |
Name for the similarity column in the output table. Must contain only letters (a-z, A-Z), digits, and underscores (_), start with a letter, and have a maximum length of 128 bytes. Default: output. |
|
|
Parameter settings |
Number of top similarity values |
Number of top similarity values to return for each input string. Must be a positive integer. Default: 10. |
|
Similarity calculation method |
Similarity calculation method. Valid values:
|
|
|
Substring length |
Required only when Similarity calculation method is set to ssk, cosine, or simhash_hamming_sim. Value range: (0, 100). Default: 2. |
|
|
Matching string weight |
Required only when Similarity calculation method is set to ssk or simhash_hamming_sim. Value range: (0, 1). Default: 0.5. |
|
|
Execution tuning |
Number of cores |
Allocated by default. |
|
Memory per core (MB) |
Automatically allocated by default. |
Configure via PAI command
Configure parameters using PAI commands. Use the SQL Script component to run PAI commands. For more information, see SQL Script.
PAI -name string_similarity_topn
-project algo_public
-DinputTableName="pai_test_string_similarity_topn"
-DoutputTableName="pai_test_string_similarity_topn_output"
-DmapTableName="pai_test_string_similarity_map_topn"
-DinputSelectedColName="col0"
-DmapSelectedColName="col1"
|
Parameter name |
Required |
Description |
Default value |
|
inputTableName |
Yes |
Name of the input table. |
None |
|
mapTableName |
Yes |
Name of the mapping table. |
None |
|
outputTableName |
Yes |
Name of the output table. |
None |
|
inputSelectedColName1 |
No |
Name of the column from the left table to use for similarity calculation. |
First STRING column in the table |
|
inputSelectedColName2 |
No |
Name of the column from the mapping table to use for similarity calculation. |
First STRING column in the table |
|
inputAppendColNames |
No |
Names of columns from the input table to include in the output table. |
None |
|
inputAppendRenameColNames |
No |
Aliases for columns from the input table to include in the output table. |
None |
|
mapSelectedColName |
Yes |
Name of the column from the mapping table to use for similarity calculation. |
None |
|
mapAppendColNames |
No |
Names of columns from the mapping table to include in the output table. |
None |
|
mapAppendRenameColNames |
No |
Aliases for columns from the mapping table to include in the output table. |
None |
|
inputTablePartitions |
No |
Names of partitions in the input table. |
All partitions |
|
mapTablePartitions |
No |
Names of partitions in the mapping table. |
All partitions |
|
outputColName |
No |
Name of the similarity column in the output table. Must contain only letters (a-z, A-Z), digits, or underscores (_), start with a letter, and be no more than 128 bytes long. |
output |
|
method |
No |
Similarity calculation method. Valid values:
|
levenshtein_sim |
|
lambda |
No |
Required only when Similarity calculation method is set to ssk or simhash_hamming_sim. Value range: (0, 1). |
0.5 |
|
k |
No |
Required only when Similarity calculation method is set to ssk, cosine, or simhash_hamming_sim. Value range: (0, 100). |
2 |
|
lifecycle |
No |
Number of days to retain the output table. Must be a positive integer. |
None |
|
coreNum |
No |
Number of CPU cores to allocate for calculation. |
System-assigned |
|
memSizePerCore |
No |
Amount of memory to allocate per CPU core. |
Automatically assigned |
Resource usage
This component uses M × N computational complexity. To find the closest strings for N records within a set of M records, the algorithm calculates the distance between each pair of samples, resulting in M × N calculations. Resources required are directly proportional to M × N.
To find the nearest records for N records within a set of M records, the required worker count is (M × N) / (1024 × 1024 × 32), up to a maximum of 1,000. Memory per worker is N/8 MB, ranging from 4 GB to 64 GB. According to the billing model, one computing unit (CU) provides 4 GB of memory. The maximum CU request for this algorithm is 1,000 × 64 / 4 = 16,000 CUs.
Reference
-
String Similarity - Calculates string similarity for applications in information retrieval, natural language processing, and bioinformatics