This topic describes the Text Summarization component provided by Machine Learning Studio.
The Text Summarization component can automatically generate abstracts. An abstract is a simple and coherent short text that accurately reflects the main idea of a document. The component allows computers to extract an abstract from a document.
The component uses a TextRank-based algorithm to extract sentences from a document to generate an abstract. For more information, see TextRank: Bringing Order into Texts.
Configure the component
You can configure the component by using one of the following methods:
- Machine Learning Platform for AI console
Tab Parameter Description Fields Settings Column of Marked Document IDs The name of the document ID column. Sentence Column You can specify a column. Parameters Setting Output First N Key Sentences Default value: 3. Sentence Similarity Calculation Method The method used to calculate sentence similarities. Valid values: - Ics_sim
- leveshtein_sim
- ssk
- cosine
Weight of Matching String This parameter takes effect only when Sentence Similarity Calculation Method is set to ssk. Default value: 0.5. Substring Length This parameter takes effect only when Sentence Similarity Calculation Method is set to ssk or cosine. Default value: 2. Damping coefficient Default value: 0.85. Maximum Iterations Default value: 100. Convergence Coefficient Default value: 0.000001. Tuning Cores Automatically allocated. Memory Size per Core Automatically allocated. - PAI command
PAI -name TextSummarization -project algo_public -DinputTableName="test_input" -DoutputTableName="test_output" -DdocIdCol="doc_id" -DsentenceCol="sentence" -DtopN=2 -Dlifecycle=30;
Parameter Required Description Default value inputTableName Yes The name of the input table. No default value inputTablePartitions No The partitions selected from the input table for computing. Full table outputTableName Yes The name of the output table. No default value docIdCol Yes The name of the document ID column. No default value sentenceCol Yes The sentence column. You can specify only one column. No default value topN No The top N key sentences to be provided. 3 similarityType No The method used to calculate sentence similarities. Valid values: - Ics_sim
- leveshtein_sim
- ssk
- cosine
lcs_sim lambda No The weight of a matched string. This parameter takes effect only when similarityType is set to ssk. 0.5 k No The length of a substring. This parameter takes effect only when similarityType is set to ssk or cosine. 2 dampingFactor No The damping coefficient. 0.85 maxIter No The maximum number of iterations. 100 epsilon No The convergence coefficient. 0.000001 lifecycle No The lifecycle of the input and output tables. No default value coreNum No The number of cores involved in computing. Automatically allocated memSizePerCore No The memory size for each core. Automatically allocated
Example
The output table contains the doc_id and abstract columns.
doc_id | abstract |
---|---|
1000894 | In 2008, the Shanghai Stock Exchange published disclosure guidelines for the corporate social responsibility (CSR) of listed companies. Three types of companies were urged to disclose their CSR reports, and other qualified listed companies were encouraged to voluntarily disclose their CSR reports. In 2012, a total of 379 listed companies made up 40% of all listed companies disclosed CSR reports. Among those companies, 305 were mandated to disclose CSR reports and 74 voluntarily disclosed CSR reports. According to Hu Ruyin, the Shanghai Stock Exchange will explore how to expand the scope of CSR report disclosure, revise and refine the guidelines on disclosure of the CSR reports, and encourage more organizations to promote CSR product innovation. |