Configure the String Similarity - top N component - Platform For AI

Calculates string similarity between input strings and mapping table entries, then returns the top N matches for each input string.

Configuration

Configure the component on the Designer workflow page or using PAI commands.

Configure via GUI

Configure parameters on the Designer workflow page.

Tab	Parameter	Description
Field settings	Columns to append from input table	Columns from the input table to include in the output.
	Columns to append from mapping table	Columns from the mapping table to include in the output.
	Left table column for similarity calculation	Column from the left table to use for similarity calculation.
	Mapping table column for similarity calculation	Column from the mapping table to use for similarity calculation. The component calculates similarity between each row in the left table and all strings in the mapping table, then returns the top N results.
	Similarity column name in output table	Name for the similarity column in the output table. Must contain only letters (a-z, A-Z), digits, and underscores (_), start with a letter, and have a maximum length of 128 bytes. Default: output.
Parameter settings	Number of top similarity values	Number of top similarity values to return for each input string. Must be a positive integer. Default: 10.
	Similarity calculation method	Similarity calculation method. Valid values: levenshtein_sim (default) lcs_sim ssk cosine simhash_hamming_sim
	Substring length	Required only when Similarity calculation method is set to ssk, cosine, or simhash_hamming_sim. Value range: (0, 100). Default: 2.
	Matching string weight	Required only when Similarity calculation method is set to ssk or simhash_hamming_sim. Value range: (0, 1). Default: 0.5.
Execution tuning	Number of cores	Allocated by default.
Execution tuning	Memory per core (MB)	Automatically allocated by default.

Configure via PAI command

Configure parameters using PAI commands. Use the SQL Script component to run PAI commands. For more information, see SQL Script.

PAI -name string_similarity_topn
    -project algo_public
    -DinputTableName="pai_test_string_similarity_topn"
    -DoutputTableName="pai_test_string_similarity_topn_output"
    -DmapTableName="pai_test_string_similarity_map_topn"
    -DinputSelectedColName="col0"
    -DmapSelectedColName="col1"

Parameter name	Required	Description	Default value
inputTableName	Yes	Name of the input table.	None
mapTableName	Yes	Name of the mapping table.	None
outputTableName	Yes	Name of the output table.	None
inputSelectedColName1	No	Name of the column from the left table to use for similarity calculation.	First STRING column in the table
inputSelectedColName2	No	Name of the column from the mapping table to use for similarity calculation.	First STRING column in the table
inputAppendColNames	No	Names of columns from the input table to include in the output table.	None
inputAppendRenameColNames	No	Aliases for columns from the input table to include in the output table.	None
mapSelectedColName	Yes	Name of the column from the mapping table to use for similarity calculation.	None
mapAppendColNames	No	Names of columns from the mapping table to include in the output table.	None
mapAppendRenameColNames	No	Aliases for columns from the mapping table to include in the output table.	None
inputTablePartitions	No	Names of partitions in the input table.	All partitions
mapTablePartitions	No	Names of partitions in the mapping table.	All partitions
outputColName	No	Name of the similarity column in the output table. Must contain only letters (a-z, A-Z), digits, or underscores (_), start with a letter, and be no more than 128 bytes long.	output
method	No	Similarity calculation method. Valid values: levenshtein_sim lcs_sim ssk cosine simhash_hamming_sim	levenshtein_sim
lambda	No	Required only when Similarity calculation method is set to ssk or simhash_hamming_sim. Value range: (0, 1).	0.5
k	No	Required only when Similarity calculation method is set to ssk, cosine, or simhash_hamming_sim. Value range: (0, 100).	2
lifecycle	No	Number of days to retain the output table. Must be a positive integer.	None
coreNum	No	Number of CPU cores to allocate for calculation.	System-assigned
memSizePerCore	No	Amount of memory to allocate per CPU core.	Automatically assigned

Resource usage

This component uses M × N computational complexity. To find the closest strings for N records within a set of M records, the algorithm calculates the distance between each pair of samples, resulting in M × N calculations. Resources required are directly proportional to M × N.

To find the nearest records for N records within a set of M records, the required worker count is (M × N) / (1024 × 1024 × 32), up to a maximum of 1,000. Memory per worker is N/8 MB, ranging from 4 GB to 64 GB. According to the billing model, one computing unit (CU) provides 4 GB of memory. The maximum CU request for this algorithm is 1,000 × 64 / 4 = 16,000 CUs.

Reference

Designer overview
String Similarity - Calculates string similarity for applications in information retrieval, natural language processing, and bioinformatics