The Text Summarization component uses an automatic summarization algorithm based on the TextRank model to extract key sentences from a document. This process generates a concise and coherent summary that accurately captures the main idea of the original document. This topic describes how to configure the Text Summarization component.
Limits
The supported computing engine is MaxCompute.
Usage notes
Add a Sentence Splitting component upstream to split the text into one sentence per row.
Component configuration
You can configure the component parameters in one of the following ways.
Method 1: Use the GUI
You can configure the component parameters on the Designer workflow page.
|
Tab |
Parameter |
Description |
|
Fields Setting |
Column for document ID |
Enter the name of the column that contains document IDs. |
|
Sentence column |
Specify one column. |
|
|
Parameters Setting |
Number of key sentences to output |
The default value is 3. |
|
Sentence similarity calculation method |
The method to calculate sentence similarity:
|
|
|
Weight of matching string |
This parameter is active when Sentence similarity calculation method is set to ssk. The default value is 0.5. |
|
|
Length of substring |
This parameter is active when Sentence similarity calculation method is set to ssk or cosine. The default value is 2. |
|
|
Damping factor |
The default value is 0.85. |
|
|
Maximum iterations |
The default value is 100. |
|
|
Convergence coefficient |
The default value is 0.000001. |
|
|
Execution tuning |
Number of cores |
Automatically allocated. |
|
Memory per core |
Automatically allocated. |
Method 2: Use PAI commands
You can use PAI commands to configure the component parameters. To do this, use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name TextSummarization
-project algo_public
-DinputTableName="test_input"
-DoutputTableName="test_output"
-DdocIdCol="doc_id"
-DsentenceCol="sentence"
-DtopN=2
-Dlifecycle=30;
|
Parameter |
Required |
Description |
Default value |
|
inputTableName |
Yes |
The input table name. |
None |
|
inputTablePartitions |
No |
The partitions in the input table to use for computation. |
All partitions of the input table |
|
outputTableName |
Yes |
The output table name. |
None |
|
docIdCol |
Yes |
The name of the column that contains document IDs. |
None |
|
sentenceCol |
Yes |
The sentence column. You can specify only one column. |
None |
|
topN |
No |
The output consists of the first few key sentences. |
3 |
|
similarityType |
No |
The method to calculate sentence similarity:
|
lcs_sim |
|
lambda |
No |
The weight of a matching string. This parameter is available when `similarityType` is set to ssk. |
0.5 |
|
k |
No |
The length of a substring. This parameter is available when `similarityType` is set to ssk or cosine. |
2 |
|
dampingFactor |
No |
The damping factor. |
0.85 |
|
maxIter |
No |
The maximum number of iterations. |
100 |
|
epsilon |
No |
The convergence coefficient. |
0.000001 |
|
lifecycle |
No |
The lifecycle of the output table. |
None |
|
coreNum |
No |
The number of cores for computation. |
Automatically allocated by the system |
|
memSizePerCore |
No |
The memory required for each core. |
Automatically allocated by the system |
Example
-
Prepare the input table `test_input`. The following table shows sample data.
You can use the MaxCompute client to create a table and use Tunnel commands to upload data. For more information about how to install and configure the MaxCompute client, see Connect using the local client (odpscmd). For more information about Tunnel commands, see Tunnel commands.
doc_id
sentence
1000897
Since the COVID-19 outbreak, the consumption of wild animals has become a prominent issue. This poses a great risk to public health and has drawn widespread social concern. Public security, forestry, and market regulation departments across the country have launched special campaigns to combat the illegal hunting, trafficking, and consumption of wild animals, achieving notable success. While cracking down on these illegal activities, law enforcement found that a large consumer base, enormous poaching profits, and the difficulty and high cost of identification are key reasons the illegal wildlife trade continues to thrive.
Where:
-
doc_id: The document ID column.
-
sentence: The sentence column.
-
-
Use the Sentence Splitting component to split the text in the `sentence` column into one sentence per row. The output table is named `test_output`. The following table shows the content. For more information, see Sentence Splitting.
doc_id
sentence
1000897
Since the COVID-19 outbreak, the consumption of wild animals has become a prominent issue.
1000897
This poses a great risk to public health and has drawn widespread social concern.
1000897
Public security, forestry, and market regulation departments across the country have launched special campaigns to combat the illegal hunting, trafficking, and consumption of wild animals, achieving notable success.
1000897
While cracking down on these illegal activities, law enforcement found that a large consumer base, enormous poaching profits, and the difficulty and high cost of identification are key reasons the illegal wildlife trade continues to thrive.
-
Run the following PAI command to generate a text summary.
You can use an SQL Script component or an ODPS SQL Node component to run the following PAI command.
PAI -name TextSummarization -project algo_public -DinputTableName="test_output" -DoutputTableName="test_output1" -DdocIdCol="doc_id" -DsentenceCol="sentence" -DtopN=2 -Dlifecycle=30;The output table has two columns: doc_id and abstract.
doc_id
abstract
1000897
Since the COVID-19 outbreak, the consumption of wild animals has become a prominent issue. Public security, forestry, and market regulation departments across the country have launched special campaigns to combat the illegal hunting, trafficking, and consumption of wild animals, achieving notable success.
References
-
The Sentence Splitting component preprocesses data by splitting a text segment into one sentence per row. For more information, see Sentence Splitting.
-
For more information about Designer, see Designer overview.