All Products
Search
Document Center

Platform For AI:Doc2Vec

Last Updated:Feb 06, 2024

The Doc2Vec component uses a document ID as a word in the document during training. This component represents each document as a sentence vector and obtains a word vector by using the document ID as context. You can use the Doc2Vec component to map articles to vectors. The input is a vocabulary table. The output is a document vector table, a word vector table, or a vocabulary table. This topic describes how to configure the Doc2Vec component provided by Platform for AI (PAI).

Limits

You can use the Doc2Vec component based on the computing resources of MaxCompute.

Configure the component

You can use one of the following methods to configure the Doc2Vec component:

Method 1: Configure the component in the PAI console

You can configure the parameters of the Doc2Vec component on the pipeline page of Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Document ID Column

The name of the document column that is used for training.

Document Content

The words used for training. Separate these words with spaces.

Parameters Setting

Dimensions of Word Features

The number of dimensions of the word. Valid values: 0 to 1000. Default value: 100.

Language Model

The language model used for training. Valid values:

  • Skip-gram Model (default)

  • CBOW Model

Word Window Size

The window size of words. The value must be a positive integer. Default value: 5.

Minimum Frequency of Words

The minimum frequency of words for truncation. The value must be a positive integer. Default value: 5.

Hierarchical Softmax

Specifies whether to use hierarchical softmax. By default, Hierarchical Softmax is selected.

Negative Sampling

The window size of negative sampling. The value must be a positive integer. Default value: 5. A value of 0 indicates that the negative sampling feature is unavailable.

Downsampling Threshold

The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3. A value of 0 indicates that the downsampling feature is unavailable.

Initial Learning Rate

The initial learning rate. The value must be greater than 0. Default value: 0.025.

Training Iterations

The number of iterations. The value must be greater than or equal to 1. Default value: 1.

Use Random Window

The mode that is used to display the word window. Valid values: A Random Value Between 1 to 5 and Specified by the Window Parameter. Default value: Specified by the Window Parameter.

Tuning

Number of Computing Cores

The number of computing cores. By default, the system determines the value.

Memory Size per Core (MB)

The memory size of each core. By default, the system determines the value.

Method 2: Configure the component by using PAI commands

Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.

PAI -name pai_doc2vec
    -project algo_public
    -DinputTableName="d2v_input"
    -DdocIdColName="docid"
    -DdocColName="text_seg"
    -DoutputWordTableName="d2v_word_output"
    -DoutputDocTableName="d2v_doc_output";

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input vocabulary table.

N/A

inputTablePartitions

No

The names of the partitions used for word segmentation in the input vocabulary table. Format: partition_name=value. Specify multiple partitions in the following format: name1=value1/name2=value2. Separate multiple partitions with commas (,).

N/A

docIdColName

Yes

The name of the document column that is used for training.

N/A

docColName

Yes

The words used for training. Separate these words with spaces.

N/A

layerSize

No

The number of dimensions of the word. Valid values: 0 to 1000.

100

cbow

No

The language model used for training. Valid values: 0 and 1. A value of 0 indicates the skip-gram model, and a value of 1 indicates the CBOW model.

0

window

No

The window size of words. The value must be a positive integer.

5

minCount

No

The minimum frequency of words for truncation. The value must be a positive integer.

5

hs

No

Specifies whether to use hierarchical softmax. Valid values: 0 and 1. A value of 0 indicates that hierarchical softmax is not used, and a value of 1 indicates that hierarchical softmax is used.

1

negative

No

The window size for negative sampling. The value must be a positive integer. A value of 0 indicates that the negative sample feature is unavailable.

5

sample

No

The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3. A value of 0 indicates that the downsampling feature is unavailable.

1e-3

alpha

No

The value must be greater than 0.

0.025

iterTrain

No

The value must be greater than or equal to 1.

1

randomWindow

No

The mode that is used to display the word window. Valid values: 0 and 1. A value of 0 indicates that the value is specified by the window parameter, and a value of 1 indicates a random value from 1 to 5.

1

outVocabularyTableName

No

The name of the output vocabulary table.

N/A

outputWordTableName

Yes

The name of the output word vector table.

N/A

outputDocTableName

Yes

The name of the output document vector table.

N/A

lifecycle

No

The lifecycle of the output table. The value must be a positive integer.

N/A

coreNum

No

The number of cores. This parameter and the memSizePerCore parameter take effect only when you configure both the parameters. The value must be a positive integer.

Automatically allocated

memSizePerCore

No

The memory size of each core. This parameter and the coreNum parameter take effect only when you configure both the parameters. The value must be a positive integer.

Automatically allocated

References

For information about Machine Learning Designer, see Overview of Machine Learning Designer.