This topic describes the Sentence Splitting component provided by Machine Learning Studio.
Text in a document can be split by punctuation. This component is used to process text before text summarization. It splits the text into rows, each of which contains only one sentence.
Configure the component
You can configure the component by using one of the following methods:
- Machine Learning Platform for AI console
Tab Parameter Description Fields Settings Column of Marked Document IDs The name of the document ID column. Marked Document Content Column The name of the document column. Sentence Delimiter Set Default delimiters: periods (.), exclamation points (!), and question marks (?). Tuning Cores Automatically allocated. Memory Size per Core Automatically allocated. - PAI command
PAI -name SplitSentences -project algo_public -DinputTableName="test_input" -DoutputTableName="test_output" -DdocIdCol="doc_id" -DdocContent="content" -Dlifecycle=30
Parameter Required Description Default value inputTableName Yes The name of the input table. No default value inputTablePartitions No The partitions selected from the input table for computing. Full table outputTableName Yes The name of the output table. No default value docIdCol Yes The name of the document ID column. No default value docContent Yes The name of the document content column. You can specify only one column. No default value delimiter No The delimiter used to separate sentences. Period (.), exclamation point (!), or question mark (?) lifecycle No The lifecycle of the input and output tables. No default value coreNum No The number of cores involved in computing. Automatically allocated memSizePerCore No The memory size for each core. Automatically allocated
Example
The output table contains the doc_id and sentence columns.
doc_id | sentence |
---|---|
1000894 | In 2008, the Shanghai Stock Exchange published disclosure guidelines for the corporate social responsibility (CSR) of listed companies. Three types of companies were urged to disclose their CSR reports, and other qualified listed companies were encouraged to voluntarily disclose their CSR reports. |
1000894 | In 2012, a total of 379 listed companies made up 40% of all listed companies disclosed CSR reports. Among those companies, 305 were mandated to disclose CSR reports and 74 voluntarily disclosed CSR reports. |