All Products
Search
Document Center

Platform For AI:String similarity - Top N

Last Updated:Mar 11, 2026

Calculates string similarity between input strings and mapping table entries, then returns the top N matches for each input string.

Configuration

Configure the component on the Designer workflow page or using PAI commands.

Configure via GUI

Configure parameters on the Designer workflow page.

Tab

Parameter

Description

Field settings

Columns to append from input table

Columns from the input table to include in the output.

Columns to append from mapping table

Columns from the mapping table to include in the output.

Left table column for similarity calculation

Column from the left table to use for similarity calculation.

Mapping table column for similarity calculation

Column from the mapping table to use for similarity calculation. The component calculates similarity between each row in the left table and all strings in the mapping table, then returns the top N results.

Similarity column name in output table

Name for the similarity column in the output table. Must contain only letters (a-z, A-Z), digits, and underscores (_), start with a letter, and have a maximum length of 128 bytes. Default: output.

Parameter settings

Number of top similarity values

Number of top similarity values to return for each input string. Must be a positive integer. Default: 10.

Similarity calculation method

Similarity calculation method. Valid values:

  • levenshtein_sim (default)

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming_sim

Substring length

Required only when Similarity calculation method is set to ssk, cosine, or simhash_hamming_sim. Value range: (0, 100). Default: 2.

Matching string weight

Required only when Similarity calculation method is set to ssk or simhash_hamming_sim. Value range: (0, 1). Default: 0.5.

Execution tuning

Number of cores

Allocated by default.

Memory per core (MB)

Automatically allocated by default.

Configure via PAI command

Configure parameters using PAI commands. Use the SQL Script component to run PAI commands. For more information, see SQL Script.

PAI -name string_similarity_topn
    -project algo_public
    -DinputTableName="pai_test_string_similarity_topn"
    -DoutputTableName="pai_test_string_similarity_topn_output"
    -DmapTableName="pai_test_string_similarity_map_topn"
    -DinputSelectedColName="col0"
    -DmapSelectedColName="col1"

Parameter name

Required

Description

Default value

inputTableName

Yes

Name of the input table.

None

mapTableName

Yes

Name of the mapping table.

None

outputTableName

Yes

Name of the output table.

None

inputSelectedColName1

No

Name of the column from the left table to use for similarity calculation.

First STRING column in the table

inputSelectedColName2

No

Name of the column from the mapping table to use for similarity calculation.

First STRING column in the table

inputAppendColNames

No

Names of columns from the input table to include in the output table.

None

inputAppendRenameColNames

No

Aliases for columns from the input table to include in the output table.

None

mapSelectedColName

Yes

Name of the column from the mapping table to use for similarity calculation.

None

mapAppendColNames

No

Names of columns from the mapping table to include in the output table.

None

mapAppendRenameColNames

No

Aliases for columns from the mapping table to include in the output table.

None

inputTablePartitions

No

Names of partitions in the input table.

All partitions

mapTablePartitions

No

Names of partitions in the mapping table.

All partitions

outputColName

No

Name of the similarity column in the output table. Must contain only letters (a-z, A-Z), digits, or underscores (_), start with a letter, and be no more than 128 bytes long.

output

method

No

Similarity calculation method. Valid values:

  • levenshtein_sim

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming_sim

levenshtein_sim

lambda

No

Required only when Similarity calculation method is set to ssk or simhash_hamming_sim. Value range: (0, 1).

0.5

k

No

Required only when Similarity calculation method is set to ssk, cosine, or simhash_hamming_sim. Value range: (0, 100).

2

lifecycle

No

Number of days to retain the output table. Must be a positive integer.

None

coreNum

No

Number of CPU cores to allocate for calculation.

System-assigned

memSizePerCore

No

Amount of memory to allocate per CPU core.

Automatically assigned

Resource usage

This component uses M × N computational complexity. To find the closest strings for N records within a set of M records, the algorithm calculates the distance between each pair of samples, resulting in M × N calculations. Resources required are directly proportional to M × N.

To find the nearest records for N records within a set of M records, the required worker count is (M × N) / (1024 × 1024 × 32), up to a maximum of 1,000. Memory per worker is N/8 MB, ranging from 4 GB to 64 GB. According to the billing model, one computing unit (CU) provides 4 GB of memory. The maximum CU request for this algorithm is 1,000 × 64 / 4 = 16,000 CUs.

Reference