This topic describes the Doc2Vec component provided by Machine Learning Studio.

You can use the Doc2Vec component to map articles to vectors. The input is a vocabulary. The output is a document vector table, a word vector table, or a vocabulary.

You can configure the component by using one of the following methods:

Machine Learning Platform for AI console

Tab Parameter Description
Fields Setting Document ID Column The name of the document column that is used for training.
Document Content The words used for training. Separate these words by spaces.
Parameters Setting Dimensions of Word Features The number of dimensions of the word feature. Valid values: 0 to 1000. Default value: 100.
Language Model The language model used for training. Valid values: Skip-gram Model and CBOW Model. Default value: Skip-gram Model.
Window Size of Words The window size of words. Valid values: any non-zero positive integer. Default value: 5.
Minimum Frequency of Words Valid values: any non-zero positive integer. Default value: 5.
Hierarchical Softmax Specifies whether hierarchical softmax is used. Hierarchical softmax is used by default.
Negative Sampling The window size of negative sampling. Valid values: any non-zero positive integer. Default value: 5.
Downsampling Threshold The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3.
Initial Learning Rate The value is greater than 0. Default value: 0.025.
Training Iterations The value is greater than or equal to 1. Default value: 1.
Use Random Window Specifies the mode that is used to display the word window. Valid values: A Random Value Between 1 to 5 and Specified by the Window Parameter. Default value: Specified by the Window Parameter.
Tuning Computing Cores The number of cores used for calculation. The value is automatically allocated.
Memory Size per Core (Unit: MB) The size of memory required by each core. The value is automatically allocated.

PAI command

PAI -name pai_doc2vec
    -project algo_public
    -DinputTableName=d2v_input
    -DdocIdColName=docid
    -DdocColName=text_seg
    -DoutputWordTableName=d2v_word_output
    -DoutputDocTableName=d2v_doc_output;
Parameter Required Description Default value
inputTableName Yes The name of the input vocabulary. No default value
inputTablePartitions No The names of the partitions in the input vocabulary, which are used for word segmentation. This value must be in the partition_name=value format. To specify multiple partitions, use the following format: name1=value1/name2=value2. If you specify multiple partitions, separate them with commas (,). No default value
docIdColName Yes The name of the document column used for training. No default value
docColName Yes The words used for training. Separate these words with spaces. No default value
layerSize No The number of dimensions of the word feature. Valid values: 0 to 1000. 100
cbow No The language model used for training. Valid values: 0 and 1. The value 0 indicates the skip-gram model, and the value 1 indicates the CBOW model. 0
window No The window size of words. Valid values: any non-zero positive integer. 5
minCount No The minimum frequency of words for truncation. Valid values: any non-zero positive integer. 5
hs No Specifies whether to use hierarchical softmax. Valid values: 0 and 1. The value 0 indicates that hierarchical softmax is not used. The value 1 indicates that hierarchical softmax is used. 1
negative No The window size of negative sampling. Valid values: any non-zero positive integer. 5
sample No The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3. 1e-3
alpha No The value is greater than 0. 0.025
iterTrain No The value is greater than or equal to 1. 1
randomWindow No Specifies the mode that is used to display the word window. Valid values: 0 and 1. The value 1 indicates a random value from 1 to 5, and the value 0 indicates that the value is specified by the window parameter. 1
outVocabularyTableName No The name of the output vocabulary. No default value
outputWordTableName Yes The name of the output word vector table. No default value
outputDocTableName Yes The name of the output document vector table. No default value
lifecycle No The lifecycle of the output table. Valid values: any non-zero positive integer. No default value
coreNum No The number of cores. This parameter and the memSizePerCore parameter take effect only when they are both configured. Valid values: any non-zero positive integer. Automatically allocated
memSizePerCore No The size of the memory required by each core. This parameter and the coreNum parameter take effect only when they are both configured. Valid values: any non-zero positive integer. Automatically allocated