This topic describes the Text Summarization component provided by Machine Learning Studio.

The Text Summarization component can automatically generate abstracts. An abstract is a simple and coherent short text that accurately reflects the main idea of a document. The component allows computers to extract an abstract from a document.

The component uses a TextRank-based algorithm to extract sentences from a document to generate an abstract. For more information, see TextRank: Bringing Order into Texts.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Settings Column of Marked Document IDs The name of the document ID column.
    Sentence Column You can specify a column.
    Parameters Setting Output First N Key Sentences Default value: 3.
    Sentence Similarity Calculation Method The method used to calculate sentence similarities. Valid values:
    • Ics_sim
    • leveshtein_sim
    • ssk
    • cosine
    Weight of Matching String This parameter takes effect only when Sentence Similarity Calculation Method is set to ssk. Default value: 0.5.
    Substring Length This parameter takes effect only when Sentence Similarity Calculation Method is set to ssk or cosine. Default value: 2.
    Damping coefficient Default value: 0.85.
    Maximum Iterations Default value: 100.
    Convergence Coefficient Default value: 0.000001.
    Tuning Cores Automatically allocated.
    Memory Size per Core Automatically allocated.
  • PAI command
    PAI -name TextSummarization    
        -project algo_public    
        -DinputTableName="test_input"    
        -DoutputTableName="test_output"    
        -DdocIdCol="doc_id"    
        -DsentenceCol="sentence"    
        -DtopN=2    
        -Dlifecycle=30;
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    inputTablePartitions No The partitions selected from the input table for computing. Full table
    outputTableName Yes The name of the output table. No default value
    docIdCol Yes The name of the document ID column. No default value
    sentenceCol Yes The sentence column. You can specify only one column. No default value
    topN No The top N key sentences to be provided. 3
    similarityType No The method used to calculate sentence similarities. Valid values:
    • Ics_sim
    • leveshtein_sim
    • ssk
    • cosine
    lcs_sim
    lambda No The weight of a matched string. This parameter takes effect only when similarityType is set to ssk. 0.5
    k No The length of a substring. This parameter takes effect only when similarityType is set to ssk or cosine. 2
    dampingFactor No The damping coefficient. 0.85
    maxIter No The maximum number of iterations. 100
    epsilon No The convergence coefficient. 0.000001
    lifecycle No The lifecycle of the input and output tables. No default value
    coreNum No The number of cores involved in computing. Automatically allocated
    memSizePerCore No The memory size for each core. Automatically allocated

Example

The output table contains the doc_id and abstract columns.
doc_id abstract
1000894 In 2008, the Shanghai Stock Exchange published disclosure guidelines for the corporate social responsibility (CSR) of listed companies. Three types of companies were urged to disclose their CSR reports, and other qualified listed companies were encouraged to voluntarily disclose their CSR reports. In 2012, a total of 379 listed companies made up 40% of all listed companies disclosed CSR reports. Among those companies, 305 were mandated to disclose CSR reports and 74 voluntarily disclosed CSR reports. According to Hu Ruyin, the Shanghai Stock Exchange will explore how to expand the scope of CSR report disclosure, revise and refine the guidelines on disclosure of the CSR reports, and encourage more organizations to promote CSR product innovation.