All Products
Search
Document Center

Platform For AI:Text Summarization

Last Updated:Feb 19, 2024

The Text Summarization component can automatically generate abstracts based on the TextRank model. An abstract is a simple and coherent short text that accurately reflects the main idea of a document. The component allows computers to extract an abstract from a document. This topic describes how to configure the Text Summarization component provided by Platform for AI (PAI).

Limits

You can use the Text Summarization component based only on the computing resources of MaxCompute.

Usage notes

You can use a Sentence Splitting component as an upstream component to split the text into rows. Each row contains only one sentence.

Configure the component

You can use one of the following methods to configure the Text Summarization component.

Method 1: Configure the component in the PAI console

You can configure the parameters of the Text Summarization component in Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Column of Marked Document IDs

The name of the document ID column.

Sentence Column

The sentence column. You can specify only one column.

Parameters Setting

Output First N Key Sentences

The top N key sentences that you want to obtain. Default value: 3.

Sentence Similarity Calculation Method

The method used to calculate sentence similarities. Valid values:

  • Ics_sim

  • leveshtein_sim

  • ssk

  • cosine

Weight of Matching String

The weight of a matched string. This parameter takes effect only if you set the Sentence Similarity Calculation Method parameter to ssk. Default value: 0.5.

Length of Substring

The length of a substring. This parameter takes effect only if you set the Sentence Similarity Calculation Method parameter to ssk or Cosine. Default value: 2.

Damping Coefficient

The damping coefficient. Default value: 0.85.

Maximum Iterations

The maximum number of iterations. Default value: 100.

Convergence Coefficient

The convergence coefficient. Default value: 0.000001.

Tuning

Number of Cores

The number of cores used for calculation. By default, the system determines the value.

Memory Size per Core

The memory size of each core. By default, the system determines the value.

Method 2: Configure the component by using PAI commands

You can use SQL scripts to call PAI commands. For more information, see SQL Script. The following table describes the parameters.

PAI -name TextSummarization
    -project algo_public
    -DinputTableName="test_input"
    -DoutputTableName="test_output"
    -DdocIdCol="doc_id"
    -DsentenceCol="sentence"
    -DtopN=2
    -Dlifecycle=30;

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

N/A

inputTablePartitions

No

The partitions selected from the input table for computing.

All partitions

outputTableName

Yes

The name of the output table.

N/A

docIdCol

Yes

The name of the document ID column.

N/A

sentenceCol

Yes

The sentence column. You can specify only one column.

N/A

topN

No

The top N key sentences that you want to obtain.

3

similarityType

No

The method used to calculate sentence similarities. Valid values:

  • Ics_sim

  • leveshtein_sim

  • ssk

  • cosine

lcs_sim

lambda

No

The weight of a matched string. This parameter takes effect only if you set the similarityType parameter to ssk.

0.5

k

No

The length of a substring. This parameter takes effect only if you set the similarityType parameter to ssk or cosine.

2

dampingFactor

No

The damping coefficient.

0.85

maxIter

No

The maximum number of iterations.

100

epsilon

No

The convergence coefficient.

0.000001

lifecycle

No

The lifecycle of the input and output tables.

N/A

coreNum

No

The number of cores used for calculation.

Automatically allocated

memSizePerCore

No

The memory size of each core.

Automatically allocated

Examples

  1. Prepare the input table test_input. The following section provides an example.

    You can use the MaxCompute client to create a table and use Tunnel commands to upload data. For information about how to install and configure the MaxCompute client, see MaxCompute client (odpscmd). For more information about Tunnel commands, see Tunnel commands.

    doc_id

    sentence

    1000897

    Since the outbreak of the Covid-19 pandemic, the issue of consuming wild animals has been prominent. The issue brings huge risks to public health security, causing widespread concern in society. Public security, forestry, and market regulation departments across the country carried out relevant special actions to crack down on the illegal hunting, selling and consumption of wild animals, achieving remarkable results. During the process of cracking down illegal activities related to wild animals, law enforcement departments realized that the huge consumption of wild animals, huge profits of poaching, and the difficulty and high costs of identification are important reasons for the persistence of poaching of wild animals.

    Parameters:

    • doc_id: the topic ID column.

    • sentence: the sentence column.

  2. Use the Sentence Splitting component to split the text in the sentence column into rows. Each role contains only one sentence. The following table provides an example of the output table which is named test_output. For more information, see Sentence Splitting.

    doc_id

    sentence

    1000897

    Since the outbreak of the Covid-19 pandemic, the issue of consuming wild animals has been prominent.

    1000897

    The issue brings huge risks to public health security, causing widespread concern in society.

    1000897

    Public security, forestry, and market regulation departments across the country carried out relevant special actions to crack down on the illegal hunting, selling and consumption of wild animals, achieving remarkable results.

    1000897

    During the process of cracking down illegal activities related to wild animals, law enforcement departments realized that the huge consumption of wild animals, huge profits of poaching, and the difficulty and high costs of identification are important reasons for the persistence of poaching of wild animals.

  3. Run the following PAI command to generate a text summary.

    You can use an SQL script or an ODPS SQL node component to run the following PAI commands.

    PAI -name TextSummarization
        -project algo_public
        -DinputTableName="test_output"
        -DoutputTableName="test_output1"
        -DdocIdCol="doc_id"
        -DsentenceCol="sentence"
        -DtopN=2
        -Dlifecycle=30;

    The output table contains the doc_id and abstract columns.

    doc_id

    abstract

    1000897

    Since the outbreak of the Covid-19 pandemic, the issue of consuming wild animals has been prominent. Public security, forestry, and market regulation departments across the country carried out relevant special actions to crack down on the illegal hunting, selling and consumption of wild animals, achieving remarkable results.

References