All Products
Search
Document Center

Platform For AI:String Similarity - top N

Last Updated:Feb 22, 2024

The String Similarity - top N component is used to calculate string similarity and obtain the top N data records that best match the mapping table. This topic describes how to configure the String Similarity - top N component in Platform for AI (PAI).

Configure the component

You can use one of the following methods to configure the String Similarity - top N component:

Method 1: Configure the component in the PAI console

You can configure the parameters of the String Similarity - top N component in Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Columns from the Input Table Appended to the Output Table

The names of the columns that you want to append to the output table from the input table.

Columns from the Mapping Table Appended to the Output Table

The names of the columns that you want to append to the output table from the mapping table.

Columns from Left Table for Similarity Calculation

The names of the left-table columns that are used for similarity calculation.

Columns from the Mapping Table for Similarity Calculation

The names of the mapping table columns that are used for similarity calculation. The similarities between the rows in the left table and all strings in the mapping table are calculated, and the top N results are returned.

Similarity Column in Output Table

The name of the similarity column in the output table. The name can be up to 128 characters in length and can contain only letters, digits, and underscores (_). The name must start with a letter. Default value: output.

Parameters Setting

Number of Similarity Maximums in the End

The number of top N similarity values. The value must be a positive integer. Default value: 10.

Similarity Calculation Methods

The method that is used for similarity calculation. Valid values:

  • levenshtein_sim (default)

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming_sim

Length of Substring

This parameter is required only if you set the Similarity Calculation Methods parameter to ssk, cosine, or simhash_hamming_sim. Valid values: (0,100). Default value: 2.

Weight of Matching String

This parameter is required only if you set the method parameter to ssk, cosine, or simhash_hamming_sim. Value range: (0,1). Default value: 0.5.

Tuning

Number of Computing Cores

The number of computing cores. By default, the system determines the value.

Memory Size per Core (MB)

The memory size of each core. By default, the system determines the value.

Method 2: Configure the component by using PAI commands

The following table describes the parameters that are used in PAI commands. You can use the SQL script component to run PAI commands. For more information, see SQL Script.

PAI -name string_similarity_topn
    -project algo_public
    -DinputTableName="pai_test_string_similarity_topn"
    -DoutputTableName="pai_test_string_similarity_topn_output"
    -DmapTableName="pai_test_string_similarity_map_topn"
    -DinputSelectedColName="col0"
    -DmapSelectedColName="col1";

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

N/A

mapTableName

Yes

The name of the mapping table.

N/A

Yes

The name of the output table.

N/A

inputSelectedColName1

No

The names of the left table columns that are used for similarity calculation.

Name of the first STRING column in the left table

inputSelectedColName2

No

The names of the mapping table columns that are used for similarity calculation.

Name of the first STRING column in the mapping table

inputAppendColNames

No

The names of the columns that you want to append to the output table from the input table.

N/A

inputAppendRenameColNames

No

The aliases of the columns that you want to append to the output table from the input table.

N/A

mapSelectedColName

Yes

The names of the mapping table columns that are used for similarity calculation.

N/A

mapAppendColNames

No

The names of the columns that you want to append to the output table from the mapping table.

N/A

mapAppendRenameColNames

No

The aliases of the columns that you want to append to the output table from the mapping table.

N/A

inputTablePartitions

No

The names of the partitions in the input table.

All partitions

mapTablePartitions

No

The names of the partitions in the mapping table.

All partitions

outputColName

No

The name of the similarity column in the output table. The name can be up to 128 characters in length and can contain only letters, digits, and underscores (_). The name must start with a letter.

output

method

No

The method that is used for similarity calculation. Valid values:

  • levenshtein_sim

  • lcs_sim

  • ssk

  • cosine

  • simhash_hamming_sim

levenshtein_sim

lambda

No

This parameter is required only if you set the method parameter to ssk, cosine, or simhash_hamming_sim. Value range: (0,1).

0.5

k

No

This parameter is required only if you set the method parameter to ssk, cosine, or simhash_hamming_sim. Valid values: (0,100).

2

lifecycle

No

The lifecycle of the output table. The value must be a positive integer.

N/A

coreNum

No

The number of cores that are used.

Specified by the system

memSizePerCore

No

The memory size of each core.

Specified by the system

Resource usage and cost estimates

The String Similarity - top N component uses a complex algorithm that has a time complexity of O(M × N), where M is the total number of data records and N is the number of data records for which you want to find the best matching strings. The similarity of samples is measured by calculating the distance between sample data for M × N times. The amount of resources consumed by this algorithm is proportional to the product of M and N.

To use the String Similarity - top N component, you can apply for up to 1,000 worker nodes with an individual memory of 4 GB to 64 GB. The required number of worker nodes is calculated by using the following formula: M × N/(1024 × 1024 × 32). The memory of each worker node is calculated by using the following formula: N/8 MB. Example: If 1 CU provides 4 GB memory, this component can consume up to 16,000 CUs, which is calculated by using the following formula: 1000 × 64/4. For more information, see Billing example of Designer (formerly known as Machine Learning Studio).

References

  • For more information about Machine Learning Designer, see Overview of Machine Learning Designer.

  • You can use the String Similarity component to calculate string similarity in industries such as information retrieval, natural language processing, and bioinformatics. For more information about how to use this component, see String Similarity.