Configure the Text Summarization component - Platform For AI:Text Summarization

The Text Summarization component uses an automatic summarization algorithm based on the TextRank model to extract key sentences from a document. This process generates a concise and coherent summary that accurately captures the main idea of the original document. This topic describes how to configure the Text Summarization component.

Limits

The supported computing engine is MaxCompute.

Usage notes

Add a Sentence Splitting component upstream to split the text into one sentence per row.

Component configuration

You can configure the component parameters in one of the following ways.

Method 1: Use the GUI

You can configure the component parameters on the Designer workflow page.

Tab	Parameter	Description
Fields Setting	Column for document ID	Enter the name of the column that contains document IDs.
Fields Setting	Sentence column	Specify one column.
Parameters Setting	Number of key sentences to output	The default value is 3.
	Sentence similarity calculation method	The method to calculate sentence similarity: Ics_sim leveshtein_sim ssk cosine
	Weight of matching string	This parameter is active when Sentence similarity calculation method is set to ssk. The default value is 0.5.
	Length of substring	This parameter is active when Sentence similarity calculation method is set to ssk or cosine. The default value is 2.
	Damping factor	The default value is 0.85.
	Maximum iterations	The default value is 100.
	Convergence coefficient	The default value is 0.000001.
Execution tuning	Number of cores	Automatically allocated.
Execution tuning	Memory per core	Automatically allocated.

Method 2: Use PAI commands

You can use PAI commands to configure the component parameters. To do this, use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name TextSummarization
    -project algo_public
    -DinputTableName="test_input"
    -DoutputTableName="test_output"
    -DdocIdCol="doc_id"
    -DsentenceCol="sentence"
    -DtopN=2
    -Dlifecycle=30;

Parameter	Required	Description	Default value
inputTableName	Yes	The input table name.	None
inputTablePartitions	No	The partitions in the input table to use for computation.	All partitions of the input table
outputTableName	Yes	The output table name.	None
docIdCol	Yes	The name of the column that contains document IDs.	None
sentenceCol	Yes	The sentence column. You can specify only one column.	None
topN	No	The output consists of the first few key sentences.	3
similarityType	No	The method to calculate sentence similarity: Ics_sim leveshtein_sim ssk cosine	lcs_sim
lambda	No	The weight of a matching string. This parameter is available when `similarityType` is set to ssk.	0.5
k	No	The length of a substring. This parameter is available when `similarityType` is set to ssk or cosine.	2
dampingFactor	No	The damping factor.	0.85
maxIter	No	The maximum number of iterations.	100
epsilon	No	The convergence coefficient.	0.000001
lifecycle	No	The lifecycle of the output table.	None
coreNum	No	The number of cores for computation.	Automatically allocated by the system
memSizePerCore	No	The memory required for each core.	Automatically allocated by the system

Example

Prepare the input table `test_input`. The following table shows sample data.

You can use the MaxCompute client to create a table and use Tunnel commands to upload data. For more information about how to install and configure the MaxCompute client, see Connect using the local client (odpscmd). For more information about Tunnel commands, see Tunnel commands.

doc_id

sentence

1000897

Since the COVID-19 outbreak, the consumption of wild animals has become a prominent issue. This poses a great risk to public health and has drawn widespread social concern. Public security, forestry, and market regulation departments across the country have launched special campaigns to combat the illegal hunting, trafficking, and consumption of wild animals, achieving notable success. While cracking down on these illegal activities, law enforcement found that a large consumer base, enormous poaching profits, and the difficulty and high cost of identification are key reasons the illegal wildlife trade continues to thrive.

Where:

doc_id: The document ID column.
sentence: The sentence column.

Use the Sentence Splitting component to split the text in the `sentence` column into one sentence per row. The output table is named `test_output`. The following table shows the content. For more information, see Sentence Splitting.

doc_id	sentence
1000897	Since the COVID-19 outbreak, the consumption of wild animals has become a prominent issue.
1000897	This poses a great risk to public health and has drawn widespread social concern.
1000897	Public security, forestry, and market regulation departments across the country have launched special campaigns to combat the illegal hunting, trafficking, and consumption of wild animals, achieving notable success.
1000897	While cracking down on these illegal activities, law enforcement found that a large consumer base, enormous poaching profits, and the difficulty and high cost of identification are key reasons the illegal wildlife trade continues to thrive.

Run the following PAI command to generate a text summary.

You can use an SQL Script component or an ODPS SQL Node component to run the following PAI command.

PAI -name TextSummarization
    -project algo_public
    -DinputTableName="test_output"
    -DoutputTableName="test_output1"
    -DdocIdCol="doc_id"
    -DsentenceCol="sentence"
    -DtopN=2
    -Dlifecycle=30;

The output table has two columns: doc_id and abstract.

doc_id	abstract
1000897	Since the COVID-19 outbreak, the consumption of wild animals has become a prominent issue. Public security, forestry, and market regulation departments across the country have launched special campaigns to combat the illegal hunting, trafficking, and consumption of wild animals, achieving notable success.

References

The Sentence Splitting component preprocesses data by splitting a text segment into one sentence per row. For more information, see Sentence Splitting.
For more information about Designer, see Designer overview.