This topic describes the Semantic Vector Distance component provided by Machine Learning Studio.

You can calculate the extension words or sentences of the specified words or sentences based on the calculated semantic vectors, such as the word vectors calculated by the Word2Vec component. The extension words or sentences are a set of vectors that are closest to a certain vector. For example, you can generate a list of words that are most similar to a given word. This is based on the semantic vectors that are returned by the Word2Vec component.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting ID Column The ID of the column.
    Vector Columns The column that contain vector names. Example: f1 or f2.
    Parameters Setting Number of Closest Vectors to Output Default value: 5.
    Distance Calculation Mode The following calculation modes are supported:
    • euclidean
    • cosine
    • manhattan

    Default value: euclidean.

    Distance Threshold When the distance between two vectors is less than this value, the distance is provided. Default value: +∞.
    Tuning Computing Cores The number of cores used for calculation. The value is automatically allocated.
    Memory Size per Core (Unit: MB) The size of memory required by each core. The value is automatically allocated.
  • PAI command
    PAI -name SemanticVectorDistance 
        -project algo_public    
        -DinputTableName="test_input"    
        -DoutputTableName="test_output"    
        -DidColName="word"    
        -DvectorColNames="f0,f1,f2,f3,f4,f5"    
        -Dlifecycle=30
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    inputTablePartitions No The partitions that are selected from the input table for calculation. All partitions of the input table
    outputTableName Yes The name of the output table. No default value
    idTableName No The name of the vector ID table for vector calculation. The table contains only one column, and each row stores a vector ID. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation. No default value
    idTablePartitions No The partitions that are selected from the ID table for calculation. By default, all partitions are selected for calculation. No default value
    idColName Yes The name of the ID column. 3
    vectorColNames No A list of vector column names in the f1,f2 format. No default value
    topN No The number of the closest vectors in the output. Valid values: [1,+∞]. 5
    distanceType No The method that is used to calculate the distance between vectors. euclidean
    distanceThreshold No The threshold for the distance between vectors. The threshold is provided when the distance between the two vectors is less than this value. Valid values: (0,+∞). +∞
    lifecycle No The lifecycle of the input table. Valid values: any non-zero positive integer. No default value
    coreNum No The number of cores used for calculation. Valid values: any non-zero positive integer. Automatically calculated
    memSizePerCore No The size of memory required by each core. Valid values: any non-zero positive integer. Automatically calculated

Example

The output table contains the following four columns: original_id, near_id, distance, and rank.
original_id near_id distance rank
hello hi 0.2 1
hello xxx xx 2
Man Woman 0.3 1
Man xx xx 2