This topic describes the Word2Vec component provided by Machine Learning Studio.

The Word2Vec component uses a neural network to map words to vectors in the K-dimensional space based on extensive training. The component supports operations on the vectors to show the semantics of the vectors. The input is a word column or a vocabulary, and the output is a vector table and a vocabulary.

Configure the component

  • Machine Learning Platform For AI console
    Tab Parameter Description
    Fields Setting Word Cloumn The word column used for training.
    Parameters Setting Word Feature Dimension The number of dimensions of the word feature. Valid values: 0 to 1000. Default value: 100.
    Language Model The language model used for training. Valid values: Skip-gram and Cbow. Default value: Skip-gram.
    Window Size of Words The window size of words. Valid values: any non-zero positive integer. Default value: 5.
    Random Window Specifies whether to use a random window. Random Window is selected by default.
    Minimum Frequency of Words Valid values: any non-zero positive integer. Default value: 5.
    Hierarchical Softmax Specifies whether to use hierarchical softmax. Hierarchical Softmax is selected by default.
    Negative Sampling The window size of negative sampling. Default value: 0.
    Downsampling Threshold The threshold for downsampling. Default value: 0.
    Initial Learning Rate The value is greater than 0. The default value is 0.025.
    Iterations The value is greater than or equal to 1. The default value is 1.
    Tuning Cores The number of cores used for calculation. The value is automatically allocated.
    Memory Size per Core The size of memory required by each core. The value is automatically allocated.
  • PAI command
    pai -name Word2Vec
        -project algo_public
        -DinputTableName=w2v_input
        -DwordColName=word
        -DoutputTableName=w2v_output;
    Parameter Required Description Default value
    inputTableName Yes The name of the input vocabulary. No default value
    inputTablePartitions No The name of the partitions used for word segmentation. The value must be in the partition_name=value format. To specify multiple partitions, use the following format: name1=value1/name2=value2. If you specify multiple partitions, separate them with commas (,). No default value
    wordColName Yes The name of the word column. Each row in the word column contains only a single word. The </s> tag indicates a line feed. No default value
    inVocabularyTableName No The output of the wordcount operation on the input vocabulary. The wordcount operation that the system performs on the output table
    inVocabularyPartitions No The names of the partitions in the output after a wordcount operation is performed on the input vocabulary. All partitions in the output of inVocabularyTableName
    layerSize No The number of dimensions of the word feature. Valid values: 0 to 1000. 100
    cbow No The language model used for training. Valid values: 0 and 1. The value 0 indicates the skip-gram model, and the value 1 indicates the CBOW model. 0
    window No The window size of words. Valid values: any non-zero positive integer. 5
    minCount No The minimum frequency of words for truncation. Valid values: any non-zero positive integer. 5
    hs No Specifies whether to use hierarchical softmax. Valid values: 0 and 1. The value 0 indicates that hierarchical softmax is not used, and the value 1 indicates that hierarchical softmax is used. 1
    negative No The window size of negative sampling. Valid values: any non-zero positive integer. 0
    sample No The threshold for downsampling. Valid values: 1e-3 to 1e-5. 0
    alpha No The value is greater than 0. 0.025
    iterTrain No The value is greater than or equal to 1. 1
    randomWindow No Specifies the mode that is used to display the word window. Valid values: 0 and 1. The value 1 indicates a random value from 1 to 5, and the value 0 indicates that the value is specified by the window parameter. 1
    outVocabularyTableName No The name of the output vocabulary. No default value
    outputTableName Yes The name of the output vector table. No default value
    lifecycle No The lifecycle of the output table. Valid values: any non-zero positive integer. No default value
    coreNum No The number of cores. This parameter and the memSizePerCore parameter take effect only when they are both configured. Valid values: any non-zero positive integer. Automatically allocated
    memSizePerCore No The size of the memory required by each core. This parameter and the coreNum parameter take effect only when they are both configured. Valid values: any non-zero positive integer. Automatically allocated

FAQ

The error message "Vocab size is zero! vocab_size: 0" is reported. This indicates that the dictionary is empty. To solve the issue, set minCount to a smaller value.