This topic describes the Sentence Splitting component provided by Machine Learning Studio.

Text in a document can be split by punctuation. This component is used to process text before text summarization. It splits the text into rows, each of which contains only one sentence.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Settings Column of Marked Document IDs The name of the document ID column.
    Marked Document Content Column The name of the document column.
    Sentence Delimiter Set Default delimiters: periods (.), exclamation points (!), and question marks (?).
    Tuning Cores Automatically allocated.
    Memory Size per Core Automatically allocated.
  • PAI command
    PAI -name SplitSentences    
        -project algo_public    
        -DinputTableName="test_input"    
        -DoutputTableName="test_output"    
        -DdocIdCol="doc_id"    
        -DdocContent="content"    
        -Dlifecycle=30
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    inputTablePartitions No The partitions selected from the input table for computing. Full table
    outputTableName Yes The name of the output table. No default value
    docIdCol Yes The name of the document ID column. No default value
    docContent Yes The name of the document content column. You can specify only one column. No default value
    delimiter No The delimiter used to separate sentences. Period (.), exclamation point (!), or question mark (?)
    lifecycle No The lifecycle of the input and output tables. No default value
    coreNum No The number of cores involved in computing. Automatically allocated
    memSizePerCore No The memory size for each core. Automatically allocated

Example

The output table contains the doc_id and sentence columns.
doc_id sentence
1000894 In 2008, the Shanghai Stock Exchange published disclosure guidelines for the corporate social responsibility (CSR) of listed companies. Three types of companies were urged to disclose their CSR reports, and other qualified listed companies were encouraged to voluntarily disclose their CSR reports.
1000894 In 2012, a total of 379 listed companies made up 40% of all listed companies disclosed CSR reports. Among those companies, 305 were mandated to disclose CSR reports and 74 voluntarily disclosed CSR reports.