Machine Learning Platform for AI (PAI) provides you with a variety of trained natural language processing (NLP) models, such as the Bidirectional Encoder Representations from Transformers (BERT)-based text vectorization model.

NLP is a sub-branch of artificial intelligence (AI) and linguistics. You can use NLP to extract information from natural language text. NLP applies to the following scenarios:
  • Text classification: news labeling, sentiment analytics, text anti-spam, and classification of commodity reviews.
  • Text matching: Q&A matching, similarity matching for sentences, natural language inferences, and conversation retrieval.
  • Sequence labeling: named entity recognition (NER) and sentiment word extraction.
  • Feature extraction: The extracted text features can be used to process text or applied to other fields, such as computer vision.
Model Hub of PAI allows you to deploy the preceding services. It also provides the BERT-based feature extraction model.

Go to Model Hub

To go to Model Hub, perform the following steps:
  1. Log on to the PAI console.
  2. In the left-side navigation pane, click Model Management and Optimization.
  3. On the Model Management page, click the Model Hub tab.

BERT-based text vectorization model

  • Overview
    You can fine-tune the trained model that uses BERT. The vectors that are generated by BERT show great value. For example, when you use BERT to extract features, you can enter a text sequence. Then, a vector sequence is returned. After the CLS vector is processed in the Dense layer, the generated vector can be used as the vector of the whole sentence.Feature extractionWhen you enter a sentence, characters in the sentence are automatically vectorized and displayed in the Subtoken format: [CLS, tok1, tok2, ..., tokN, SEP]. The following vector types can be returned:
    • pool_output: the vectors of the encoded sentence. The vectors correspond to C' in the figure.
    • first_token_output: The vectors correspond to C in the figure.
    • all_hidden_outputs: The vectors correspond to [C, T1, T2, ..., TN, TSEP] in the figure.
  • Input format
    The input data must be in the JSON format. It contains the following fields:
    • id: the ID of the text.
    • first_sequence: The value of the field is the first text string.
    • second_sequence: The value of the field is the second text string. The value can be empty.
    • sequence_length: the length of the text strings, which cannot exceed 512.
    {
        "id": "The ID of the text",
        "first_sequence": "The first text string",
        "second_sequence": "The second text string, which can be empty",
        "sequence_length": 128
    }
  • Output format
    The output format is JSON. The output data contains the following fields.
    Field Description Shape Type
    pool_output The 768-dimensional vectors that are separated by commas (,). The vectors represent the encoded sentence and correspond to C' in the figure. [] STRING
    first_token_output The 768-dimensional vectors that are separated by commas (,). The vectors correspond to C in the figure. [] STRING
    all_hidden_outputs The 768-dimensional vectors of the sequence_length type. The vectors are separated by commas (,) and the sequences are separated by semicolons (;). The vectors correspond to [C, T1, T2, ..., TN, TSEP] in the figure. [] STRING
  • Test data
    // Input.
    {
        "id": "1667",
        "first_sequence": "How can I increase the credit limit of Ant Credit Pay to be used in Double 11?",
        "second_sequence": "",
        "sequence_length": 128
    }
    
    // Output.
    {
        "id": "1667",
        "pool_output": "0.999340713024,...,0.836870908737",
        "first_token_output": "0.789340713024,...,0.536870908737",
        "all_hidden_outputs": "0.999340713024,...,0.836870908737;... ;0.899340713024,...,0.936870908737"
    }