This topic describes the Semantic Vector Distance component provided by Machine Learning Designer.

You can calculate the extension words or sentences of the specified words or sentences based on the calculated semantic vectors, such as the word vectors calculated by the Word2Vec component. The extension words or sentences are a set of vectors that are closest to a certain vector. For example, you can generate a list of words that are most similar to a given word. This is based on the semantic vectors that are returned by the Word2Vec component.

Configure the component

You can configure the component by using Machine Learning Designer or running a Machine Learning Platform for AI command.
  • Configure the component in Machine Learning Designer
    TabParameterDescription
    Fields SettingID ColumnThe name of the ID column. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation.
    The ID column contains the ID list imported by using the second input port. Each ID occupies a cell. Examples:
    1
    2
    4
    6
    8
    Vector ColumnsThe names of columns that contain vectors. Example: f1,f2.
    Parameters SettingNumber of Closest Vectors to OutputThe number of the closest vectors in the output. Default value: 5.
    Distance Calculation ModeThe method that is used to calculate the distance between vectors. Valid values:
    • euclidean
    • cosine
    • manhattan

    Default value: Euclidean.

    Distance ThresholdThe threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Default value: +∞.
    TuningComputing CoresThe number of cores used for calculation. The value is automatically allocated.
    Memory Size per Core (Unit: MB)The memory size of each core. The value is automatically allocated.
  • Configure the component by using a Machine Learning Platform for AI command
    PAI -name SemanticVectorDistance 
        -project algo_public    
        -DinputTableName="test_input"    
        -DoutputTableName="test_output"    
        -DidColName="word"    
        -DvectorColNames="f0,f1,f2,f3,f4,f5"    
        -Dlifecycle=30
    ParameterRequiredDescriptionDefault value
    inputTableNameYesThe name of the input table.None
    inputTablePartitionsNoThe partitions selected from the input table for calculation.All partitions
    outputTableNameYesThe name of the output table.None
    idTableNameNoThe name of the vector ID table for vector calculation. The table contains only a single column, and each row stores a vector ID. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation. None
    idTablePartitionsNoThe partitions selected from the ID table for calculation. By default, all partitions are selected for calculation. None
    idColNameYesThe name of the ID column.3
    vectorColNamesNoThe names of columns that contain vectors. Example: f1,f2. None
    topNNoThe number of the closest vectors in the output. Valid values: [1,+∞]. 5
    distanceTypeNoThe method that is used to calculate the distance between vectors.euclidean
    distanceThresholdNoThe threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Valid values: (0,+∞). +∞
    lifecycleNoThe lifecycle of the input table. The value must be a positive integer. None
    coreNumNoThe number of cores used for calculation. The value must be a positive integer. Determined by the system
    memSizePerCoreNoThe memory size of each core. The value must be a positive integer. Determined by the system

Example

The output table contains the following four columns: original_id, near_id, distance, and rank.
original_idnear_iddistancerank
hellohi0.21
helloxxxxx2
ManWoman0.31
Manxxxx2