Machine Learning Platform for AI provides Natural Language Processing (NLP) models such as BERT-based text vectorization.

NLP is a sub-branch of Artificial Intelligence (AI) and linguistics. You can use NLP to extract information of natural language texts. NLP applies to the following scenarios:
  • Text classification: news labeling, sentiment analytics, spam email identification, and classification of commodity reviews.
  • Text matching: Q&A matching, similarity matching for sentences, natural language inferences, and conversation retrieval.
  • Sequence labeling: NER and sentiment word extraction.
  • Feature extraction: The extracted features can be used to process texts or applied to other fields, such as computer vision.
Model Hub of Machine Learning Platform for AI allows you to deploy the preceding services. It also provides the BERT-based feature extraction model.

BERT-based text vectorization model

  • Overview
    You can fine-tune the trained model that uses BERT. The vectors generated by BERT show great value. When you use BERT for feature extraction, you can enter a text sequence. Then, a vector sequence is returned. After the CLS vector is processed in the Dense layer, the generated vector can be used as the vector of the whole sentence.Feature extractionWhen you input a sentence, characters in the sentence are automatically vectorized and displayed in the Subtoken format: [CLS, tok1, tok2, ..., tokN, SEP]. The following vector types can be returned:
    • pool_output: The vectors of the encoded sentence. The vectors correspond to C' in the figure.
    • first_token_output: The vectors correspond to C in the figure.
    • all_hidden_outputs: The vectors correspond to [C, T1, T2, ..., TN, TSEP] in the figure.
  • Input format
    The input data must be in the JSON format. It contains the following fields:
    • first_sequence: The value of the field is the first text string.
    • second_sequence: The value of the field is the second text string. The value can be empty.
    • sequence_length: the length of the text strings, which cannot exceed 512.
    • output_schema: the vector types to be returned. If you select more than one vector types, separate them with commas (,). Available options arepool_output, first_token_output, and all_hidden_outputs.
    {
        "first_sequence": "the first text string",
        "second_sequence": "the second text string that can be empty",
        "sequence_length": 128
        "output_schema": "returned vectors types"
    }
  • Output format
    The output format is JSON. The output data contains the following fields.
    Field Description Shape Type
    pool_output The 768-dimensional vectors that are separated with commas (,). The vectors represent the encoded sentence and correspond to C' in the figure. [] STRING
    first_token_output The 768-dimensional vectors that are separated with commas (,). The vectors correspond to C in the figure. [] STRING
    all_hidden_outputs The 768-dimensional vectors of the sequence_length type. The vectors are separated with commas (,) and the sequences are separated with semicolons (;). The vectors correspond to [C, T1, T2, ..., TN, TSEP] in the figure. [] STRING
  • Test data
    # Input.
    {
        "first_sequence": "How can I increase the credit limit of Ant Credit Pay to be used in Double 11?",
        "second_sequence": "",
        "sequence_length": 128,
        "output_schema": "pool_output,first_token_output,all_hidden_outputs"
    }
    
    # Output.
    {
        "pool_output": "0.999340713024,...,0.836870908737",
        "first_token_output": "0.789340713024,...,0.536870908737",
        "all_hidden_outputs": "0.999340713024,...,0.836870908737;... ;0.899340713024,...,0.936870908737"
    }