This topic describes the Sentence Splitting component provided by Machine Learning Studio.

Text in a document can be split by punctuation. This component is used to process text before text summarization. It splits the text into rows. Each row contains only one sentence.

Configure the component

You can use one of the following methods to configure the Sentence Splitting component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Sentence Splitting component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
TabParameterDescription
Fields SettingColumn of Marked Document IDsThe name of the document ID column.
Marked Document Content ColumnThe name of the document column.
Sentence Delimiter SetThe delimiters used to separate sentences. The default delimiters are periods (.), exclamation points (!), and question marks (?).
TuningCoresThe number of cores. By default, the system determines the value.
Memory Size per CoreThe memory size of each core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name SplitSentences    
    -project algo_public    
    -DinputTableName="test_input"    
    -DoutputTableName="test_output"    
    -DdocIdCol="doc_id"    
    -DdocContent="content"    
    -Dlifecycle=30
ParameterRequiredDescriptionDefault value
inputTableNameYesThe name of the input table.No default value
inputTablePartitionsNoThe partitions selected from the input table for computing.All partitions
outputTableNameYesThe name of the output table.No default value
docIdColYesThe name of the document ID column.No default value
docContentYesThe name of the document content column. You can specify only one column. No default value
delimiterNoThe delimiters used to separate sentences.Period (.), exclamation point (!), and question mark (?)
lifecycleNoThe lifecycle of the input and output tables.No default value
coreNumNoThe number of cores used for calculation.Determined by the system
memSizePerCoreNoThe memory size of each core.Determined by the system

Example

The output table contains the doc_id and sentence columns.
doc_idsentence
1000894In 2008, the Shanghai Stock Exchange published disclosure guidelines on the corporate social responsibility (CSR) of listed companies. Three types of companies were urged to disclose their CSR reports, and other qualified listed companies were encouraged to voluntarily disclose their CSR reports.
1000894In 2012, a total of 379 listed companies made up 40% of all listed companies disclosed CSR reports. Among those companies, 305 were mandated to disclose CSR reports and 74 voluntarily disclosed CSR reports.