All Products
Search
Document Center

Platform For AI:Text Summarization

Last Updated:Jan 19, 2024

Text summarization is the process of extracting key information from lengthy and repetitive texts. For example, headlines are the results of text summarization. You can use the Text Summarization Training component of Platform for AI (PAI) to train models that generate headlines, which summarize the main points of news. This topic describes how to configure the Text Summarization Training component.

Limits

The Text Summarization Training component can use only Deep Learning Containers (DLC) computing resources.

Model architecture

The model uses the standard Transformer architecture, including an encoder and a decoder. The encoder encodes texts and the decoder decodes texts. During training, the inputs are the original news, and the outputs are the headlines.

Usage notes

You can connect the input port of the Text Summarization Training component to the Sentence Splitting component to split a text into rows, each of which contains only one sentence.

Configure the component in the PAI console

You can configure parameters for the Text Summarization Training component in Machine Learning Designer.

  • Input ports

    Input port (from left to right)

    Data type

    Recommended upstream component

    Required

    Training data

    OSS

    Read File Data

    Yes

    Validation data

    OSS

    Read File Data

    Yes

  • Component parameters

    Tab

    Parameter

    Description

    Fields Setting

    Input Schema

    The text columns of the input file. Default value: title_tokens:str:1,content_tokens:str:1.

    TextColumn

    The name of the column that corresponds to the original text in the input table. Default value: content_tokens.

    SummaryColumn

    The name of the column that corresponds to the summary in the input table. Default value: title_tokens.

    OSS Directory for Alink Model

    The directory that is used to store the generated text summarization model in an Object Storage Service (OSS) bucket.

    Parameters Setting

    Pretrained Model

    The name of the pre-trained model. You can select a model on the Parameters Setting tab. Default value: alibaba-pai/mt5-title-generation-zh.

    batchSize

    The number of samples to be processed per batch. The value must be of the INT type. Default value: 8.

    If the model is trained on multiple servers that have multiple GPUs, this parameter indicates the number of samples to be processed by each GPU in a batch.

    sequenceLength

    The maximum length of a sequence that can be processed by the system. The value must be of the INT type. Valid values: 1 to 512. Default value: 512.

    numEpochs

    The number of epochs for model training. The value must be of the INT type. Default value: 3.

    LearningRate

    The learning rate during model training. The value must be of the FLOAT type. Default value: 3e-5.

    Save Checkpoint Steps

    The number of steps that are performed before the system evaluates the model and saves the optimal model. Default value: 150.

    The model language

    Valid values:

    • zh: Chinese

    • en: English

    Whether to copy text from input while decoding

    Specify whether to copy text from the input table to the output table. Valid values:

    • false (default): no

    • true: yes

    The Minimal Length of the Predicted Sequence

    The minimum length of the output text, which is of the INT type. Default value: 12.

    The Maximal Length of the Predicted Sequence

    The maximum length of the output text, which is of the INT type. Default value: 32.

    The Minimal Non-Repeated N-gram Size

    The minimum size of a non-repeated n-gram phrase, which is of the INT type. Default value: 2. For example, if you set the parameter to 1, the output text does not include strings such as "天天".

    The Number of Beam Search Scope

    The search scope when beam search is used to select the best candidate sequences, which is of the INT type. Default value: 5. A greater value indicates a longer search time.

    The Number of Returned Candidate Sequences

    The number of top candidate sequences returned by the model, which is of the INT type. Default value: 5.

    Execution Tuning

    GPU Machine type

    The GPU-accelerated instance type of the computing resource. Default value: gn5-c8g1.2xlarge.

  • Output ports

    Output port

    Data type

    Recommended downstream component

    Required

    output model

    The OSS path of the output model. The value of this parameter is the same as the value of the ModelSavePath parameter that you set on the Fields Setting tab. The output model in the SavedModel format is stored in this OSS path.

    Use the Text Summarization Prediction component

    No

Examples

The following figure shows a sample workflow in which the Text Summarization Training component is used. 工作流In this example, the components are configured and the pipeline is run in the following manner:

  1. Prepare a training dataset (cn_train.txt) and an evaluation dataset (cn_dev.txt) and upload them to an OSS bucket. The training dataset and validation dataset used in this example are tab-delimited TXT files.

    You can also upload CSV files to MaxCompute by running the Tunnel commands on a MaxCompute client. For more information about how to install and configure a MaxCompute client, see MaxCompute client (odpscmd). For more information about Tunnel commands, see Tunnel commands.

  2. Use the Read File Data - 1 and Read File Data - 2 components to read the training dataset and the evaluation dataset. Set the OSS Data Path parameter of the Read File Data component to the OSS path in which the training dataset and the evaluation dataset are stored.

  3. Configure the training dataset and evaluation dataset as the input files of the Text Summarization Training-1 component and set the other parameters. For more information, see the "Configure the component in the PAI console" section of this topic.

  4. Click image.png to run the pipeline. After you run the pipeline, you can view the output in the OSS path specified in the ModelSavePath parameter of Text Summarization Training-1.

References