This topic describes the Doc2Vec component provided by Machine Learning Studio.
You can use the Doc2Vec component to map articles to vectors. The input is a vocabulary. The output is a document vector table, a word vector table, or a vocabulary.
You can configure the component by using one of the following methods:
Machine Learning Platform for AI console
Tab | Parameter | Description |
---|---|---|
Fields Setting | Document ID Column | The name of the document column that is used for training. |
Document Content | The words used for training. Separate these words by spaces. | |
Parameters Setting | Dimensions of Word Features | The number of dimensions of the word feature. Valid values: 0 to 1000. Default value: 100. |
Language Model | The language model used for training. Valid values: Skip-gram Model and CBOW Model. Default value: Skip-gram Model. | |
Window Size of Words | The window size of words. Valid values: any non-zero positive integer. Default value: 5. | |
Minimum Frequency of Words | Valid values: any non-zero positive integer. Default value: 5. | |
Hierarchical Softmax | Specifies whether hierarchical softmax is used. Hierarchical softmax is used by default. | |
Negative Sampling | The window size of negative sampling. Valid values: any non-zero positive integer. Default value: 5. | |
Downsampling Threshold | The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3. | |
Initial Learning Rate | The value is greater than 0. Default value: 0.025. | |
Training Iterations | The value is greater than or equal to 1. Default value: 1. | |
Use Random Window | Specifies the mode that is used to display the word window. Valid values: A Random Value Between 1 to 5 and Specified by the Window Parameter. Default value: Specified by the Window Parameter. | |
Tuning | Computing Cores | The number of cores used for calculation. The value is automatically allocated. |
Memory Size per Core (Unit: MB) | The size of memory required by each core. The value is automatically allocated. |
PAI command
PAI -name pai_doc2vec
-project algo_public
-DinputTableName=d2v_input
-DdocIdColName=docid
-DdocColName=text_seg
-DoutputWordTableName=d2v_word_output
-DoutputDocTableName=d2v_doc_output;
Parameter | Required | Description | Default value |
---|---|---|---|
inputTableName | Yes | The name of the input vocabulary. | No default value |
inputTablePartitions | No | The names of the partitions in the input vocabulary, which are used for word segmentation.
This value must be in the partition_name=value format. To specify multiple partitions, use the following format: name1=value1/name2=value2 . If you specify multiple partitions, separate them with commas (,).
|
No default value |
docIdColName | Yes | The name of the document column used for training. | No default value |
docColName | Yes | The words used for training. Separate these words with spaces. | No default value |
layerSize | No | The number of dimensions of the word feature. Valid values: 0 to 1000. | 100 |
cbow | No | The language model used for training. Valid values: 0 and 1. The value 0 indicates the skip-gram model, and the value 1 indicates the CBOW model. | 0 |
window | No | The window size of words. Valid values: any non-zero positive integer. | 5 |
minCount | No | The minimum frequency of words for truncation. Valid values: any non-zero positive integer. | 5 |
hs | No | Specifies whether to use hierarchical softmax. Valid values: 0 and 1. The value 0 indicates that hierarchical softmax is not used. The value 1 indicates that hierarchical softmax is used. | 1 |
negative | No | The window size of negative sampling. Valid values: any non-zero positive integer. | 5 |
sample | No | The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3. | 1e-3 |
alpha | No | The value is greater than 0. | 0.025 |
iterTrain | No | The value is greater than or equal to 1. | 1 |
randomWindow | No | Specifies the mode that is used to display the word window. Valid values: 0 and 1. The value 1 indicates a random value from 1 to 5, and the value 0 indicates that the value is specified by the window parameter. | 1 |
outVocabularyTableName | No | The name of the output vocabulary. | No default value |
outputWordTableName | Yes | The name of the output word vector table. | No default value |
outputDocTableName | Yes | The name of the output document vector table. | No default value |
lifecycle | No | The lifecycle of the output table. Valid values: any non-zero positive integer. | No default value |
coreNum | No | The number of cores. This parameter and the memSizePerCore parameter take effect only when they are both configured. Valid values: any non-zero positive integer. | Automatically allocated |
memSizePerCore | No | The size of the memory required by each core. This parameter and the coreNum parameter take effect only when they are both configured. Valid values: any non-zero positive integer. | Automatically allocated |