what is the Doc2Vec algorithm component - Platform For AI

Doc2Vec is a machine learning algorithm used for generating document vectors. It operates by treating document IDs as special words during training, learning sentence vectors and word vectors corresponding to these document IDs. This algorithm enables the transformation of articles into vectors, allowing for the comparison of semantic relationships between documents through distances in vector space. The input consists of a vocabulary, while the outputs are a table of document vectors, a table of word vectors, or a vocabulary table.

Limits

You can use the Doc2Vec component based on the computing resources of MaxCompute.

Configure the component

You can use one of the following methods to configure the Doc2Vec component:

Method 1: Configure the component in the PAI console

You can configure the parameters of the Doc2Vec component on the pipeline page of Machine Learning Designer. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Document ID Column	The name of the document column that is used for training.
Fields Setting	Document Content	The words used for training. Separate these words with spaces.
Parameters Setting	Dimensions of Word Features	The number of dimensions of the word. Valid values: 0 to 1000. Default value: 100.
	Language Model	The language model used for training. Valid values: Skip-gram Model (default) CBOW Model
	Word Window Size	The window size of words. The value must be a positive integer. Default value: 5.
	Minimum Frequency of Words	The minimum frequency of words for truncation. The value must be a positive integer. Default value: 5.
	Hierarchical Softmax	Specifies whether to use hierarchical softmax. By default, Hierarchical Softmax is selected.
	Negative Sampling	The window size of negative sampling. The value must be a positive integer. Default value: 5. A value of 0 indicates that the negative sampling feature is unavailable.
	Downsampling Threshold	The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3. A value of 0 indicates that the downsampling feature is unavailable.
	Initial Learning Rate	The initial learning rate. The value must be greater than 0. Default value: 0.025.
	Training Iterations	The number of iterations. The value must be greater than or equal to 1. Default value: 1.
	Use Random Window	The mode that is used to display the word window. Valid values: A Random Value Between 1 to 5 and Specified by the Window Parameter. Default value: Specified by the Window Parameter.
Tuning	Number of Computing Cores	The number of computing cores. By default, the system determines the value.
Tuning	Memory Size per Core (MB)	The memory size of each core. By default, the system determines the value.

Method 2: Configure the component by using PAI commands

Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.

PAI -name pai_doc2vec
    -project algo_public
    -DinputTableName="d2v_input"
    -DdocIdColName="docid"
    -DdocColName="text_seg"
    -DoutputWordTableName="d2v_word_output"
    -DoutputDocTableName="d2v_doc_output";

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input vocabulary table.	N/A
inputTablePartitions	No	The names of the partitions used for word segmentation in the input vocabulary table. Format: `partition_name=value`. Specify multiple partitions in the following format: `name1=value1/name2=value2`. Separate multiple partitions with commas (,).	N/A
docIdColName	Yes	The name of the document column that is used for training.	N/A
docColName	Yes	The words used for training. Separate these words with spaces.	N/A
layerSize	No	The number of dimensions of the word. Valid values: 0 to 1000.	100
cbow	No	The language model used for training. Valid values: 0 and 1. A value of 0 indicates the skip-gram model, and a value of 1 indicates the CBOW model.	0
window	No	The window size of words. The value must be a positive integer.	5
minCount	No	The minimum frequency of words for truncation. The value must be a positive integer.	5
hs	No	Specifies whether to use hierarchical softmax. Valid values: 0 and 1. A value of 0 indicates that hierarchical softmax is not used, and a value of 1 indicates that hierarchical softmax is used.	1
negative	No	The window size for negative sampling. The value must be a positive integer. A value of 0 indicates that the negative sample feature is unavailable.	5
sample	No	The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3. A value of 0 indicates that the downsampling feature is unavailable.	1e-3
alpha	No	The value must be greater than 0.	0.025
iterTrain	No	The value must be greater than or equal to 1.	1
randomWindow	No	The mode that is used to display the word window. Valid values: 0 and 1. A value of 0 indicates that the value is specified by the window parameter, and a value of 1 indicates a random value from 1 to 5.	1
outVocabularyTableName	No	The name of the output vocabulary table.	N/A
outputWordTableName	Yes	The name of the output word vector table.	N/A
outputDocTableName	Yes	The name of the output document vector table.	N/A
lifecycle	No	The lifecycle of the output table. The value must be a positive integer.	N/A
coreNum	No	The number of cores. This parameter and the memSizePerCore parameter take effect only when you configure both the parameters. The value must be a positive integer.	Automatically allocated
memSizePerCore	No	The memory size of each core. This parameter and the coreNum parameter take effect only when you configure both the parameters. The value must be a positive integer.	Automatically allocated

References

For information about Machine Learning Designer, see Overview of Machine Learning Designer.