All Products
Search
Document Center

Platform For AI:String similarity

Last Updated:Mar 05, 2026

String similarity calculation is a fundamental operation in machine learning. It assesses the similarity or difference between two strings. This calculation is widely used in fields such as information retrieval, natural language processing, and bioinformatics. It uses different algorithms and metrics, such as Levenshtein Distance and Cosine Similarity, to identify, match, or cluster similar text data.

Algorithm description

The String Similarity component supports five similarity calculation methods: Levenshtein (Levenshtein Distance), LCS (Longest Common Substring), SSK (String Subsequence Kernel), Cosine, and Simhash_Hamming. The component supports pairwise calculation.

  • The Levenshtein method supports distance and similarity calculations.

    • Distance is represented by the levenshtein parameter.

    • Similarity = 1 - Distance. Similarity is represented by the levenshtein_sim parameter.

  • The LCS method supports distance and similarity calculations.

    • Distance is represented by the lcs parameter.

    • Similarity = 1 - Distance. Similarity is represented by the lcs_sim parameter.

  • The SSK method supports similarity calculation. It is represented by the ssk parameter.

  • The Cosine method supports similarity calculation. It is represented by the cosine parameter.

  • The Simhash_Hamming method uses the SimHash algorithm to map the original text to a 64-bit binary fingerprint. The Hamming Distance is then used to calculate the number of different characters at the same position in the binary fingerprints. This method supports both distance and similarity calculations.

    • Distance is represented by the simhash_hamming parameter.

    • Similarity = 1 - Distance/64.0. Similarity is represented by the simhash_hamming_sim parameter.

Component Configuration

Method 1: Use the GUI

Add the String Similarity component to the Designer workflow. Then, configure the parameters in the right-side pane.

Parameter type

Parameter

Description

Fields setting

Columns to append to output table

The columns to append to the output table.

First column for similarity calculation

The default value is the name of the first column of the STRING type in the table.

Second column for similarity calculation

The default value is the name of the second column of the STRING type in the table.

Similarity column in output table

The name of the similarity column in the output table.

Parameters setting

Similarity calculation method

The method for similarity calculation. Valid values:

  • levenshtein

  • levenshtein_sim

  • lcs

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming

  • simhash_hamming_sim

Default value: levenshtein_sim.

Substring length

This parameter is required only when the Similarity Calculation Method parameter is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). Default value: 2.

Weight of matching string

This parameter is required only when the Similarity Calculation Method parameter is set to ssk, simhash_hamming, or simhash_hamming_sim. Valid values: (0,1). Default value: 0.5.

Execution tuning

Number of cores for computing

By default, it is assigned by the system.

Memory size per core (MB)

By default, it is automatically assigned.

Method 2: Use PAI commands

You can use PAI commands to configure the String Similarity component. You can use the SQL script component to invoke PAI commands. For more information, see SQL Script.

PAI -name string_similarity
    -project algo_public
    -DinputTableName="pai_test_string_similarity"
    -DoutputTableName="pai_test_string_similarity_output"
    -DinputSelectedColName1="col0"
    -DinputSelectedColName2="col1";

Parameter

Required

Default value

Description

inputTableName

Yes

None

The name of the input table.

outputTableName

Yes

None

The name of the output table.

inputSelectedColName1

No

The name of the first column of the STRING type in the table

The name of the first column for the similarity calculation.

inputSelectedColName2

No

The name of the second column of the STRING type in the table

The second column for the similarity calculation.

inputAppendColNames

No

None

The columns to append to the output table.

inputTablePartitions

No

All partitions

The partitions of the input table.

outputColName

No

output

The name of the similarity column in the output table. The name cannot contain special characters. It can contain only letters (a-z, A-Z), digits, and underscores (_). It must start with a letter and be no more than 128 bytes in length.

method

No

levenshtein_sim

The method for similarity calculation. Valid values:

  • levenshtein

  • levenshtein_sim

  • lcs

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming

  • simhash_hamming_sim

lambda

No

0.5

This parameter is required only when the Similarity Calculation Method parameter is set to ssk. Valid values: (0,1).

k

No

2

This parameter is required only when the Method parameter is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100).

lifecycle

No

None

The lifecycle of the output table. The value must be a positive integer.

coreNum

No

The system automatically allocates resources.

The number of cores for computing.

memSizePerCore

No

System-assigned

The memory size per core.

References

  • For more information about Designer, see Designer overview.

  • You can also use the String Similarity-Top N component to calculate string similarity and retrieve the top N most similar data records. For more information about this component, see String Similarity-Top N.