Bidirectional Encoder Representations from Transformers (BERT) embedding uses the original text as the input and provides a vector sequence after feature extraction by the system. After the CLS vector is processed at the Dense layer, the generated vector can be used as the vector of the whole sentence. This topic describes the BERT Embedding component provided by Machine Learning Studio.

The BERT Embedding component uses the original text as the input and provides the vector after BERT.1
  • pool_output: C' in the figure, which is the vector after a sentence is encoded.
  • first_token_output: C in the figure.
  • all_hidden_outputs: [C, T1, T2, ..., TN, TSEP] in the figure.
This component has the following features:
  • The command is simple and requires only four parameters in specific cases.
  • E2E output is supported for MaxCompute tables. Vectors are provided after the original text is imported. You only need to specify the output table name.
  • Parameters in the input table can be added to the output table.

Billing

This component calls GPU computing resources. The billing rule for this component is different from that for other text analysis components. The price for other text analysis components is CNY 1.7 per computing hour. The price for this component is CNY 12 per computing hour in Beijing clusters (P100 GPU computing processor) and CNY 8.4 per computing hour in Shanghai clusters (M40 GPU computing processor).

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting First Sequence Column The output field for the first column.
    Second Sequence Column The output field for the second column.
    Appended Columns The output fields for the appended columns.
    Parameters Setting batchSize Default value: 256.
    sequenceLength Default value: 128.
    Output Schema Valid values: pool_output, first_token_output, and all_hidden_outputs.
    Model Valid values: pai-bert-base-zh, pai-bert-small-zh, and pai-bert-large-zh.
    Tuning Workers Valid values: 1, 2, 3, and 4.
    GPUs per Worker Valid values: 1 and 2.
    CPUs per Worker Default value: 1.
  • PAI command
    PAI -name ez_bert_feat_ext
        -DinputTable=odps://{project}/tables/{Table name}
        -DoutputTable=odps://{project}/tables/{Table name}
        -DfirstSequence=col0
        -DsecondSequence=''
        -DappendCols=co1,col2,col0
        -DoutputSchema=pool_output,first_token_output,all_hidden_outputs
        -DsequenceLength=128
            -DmodelName=pai-bert-base-zh
        -DbatchSize=100
        -DworkerCount=1
        -DworkerCPU=1
        -DworkerGPU=1
        -Dbuckets=oss://atp-modelzoo/tmp/? role_arn=${role_arn}&host=oss-cn-hangzhou.aliyuncs.com
    Parameter Required Description Default value
    inputTable Yes The name of the input table for feature extraction. The value must be a string in the format of project.table. No default value
    outputTable Yes The name of the output feature table. The value must be a string is in the format of project.table. No default value
    firstSequence Yes The column that corresponds to the first text sequence in the input table. The value must be a string. No default value
    secondSequence No The column that corresponds to the second text sequence in the input table. The value must be a string. This parameter is empty by default.
    appendCols No The name of the columns added to the output table from the input table. The value must be a string. This parameter is empty by default.
    outputSchema No The features required in the output. The value must be a string. 'pool_output’ ,’pool_output,first_token_output,all_hidden_outputs' (multiple features supported)
    sequenceLength No The maximum length of a sequence. The value must be an integer in the range of [1,512]. 128
    modelName No The name of the pre-trained model. The value must be a string. pai-bert-base-zh.
    batchSize No The batch size for feature extraction. The value must be a string. 256
    workerCount No The number of workers. The value must be an integer. 1
    workerGPU No The number of GPUs for each worker. The value must be an integer. 1
    workerCPU No The number of CPUs for each worker. The value must be an integer. 1