Configure Word Splitting - Platform For AI

This topic describes the Split Word component provided by Designer.

The Split Word component uses the Alibaba Word Segmenter (AliWS) to tokenize content in a specified column. The resulting tokens are separated by spaces. If you configure part-of-speech (POS) tagging or semantic tagging, the output includes the tokens, POS tags, and semantic tags. POS tags are separated by a forward slash (/). Semantic tags are separated by a vertical bar (|).

The Split Word component supports only the TAOBAO_CHN and INTERNET_CHN tokenizers.

You can configure the Split Word component in Designer using the GUI or PAI commands.

Component configuration

You can configure the Split Word component in the following ways.

Method 1: Use the GUI

You can configure the component on the workflow page of Designer.

Tab	Parameter	Description
Fields Setting	Column Name	The column to tokenize.
Parameters Setting	Recognition Options	The content types to detect. Valid values: Detect simple entities Detect person names Detect organization names Detect phone numbers Detect time Detect date Detect numbers and letters Default values: Detect simple entities, Detect phone numbers, Detect time, Detect date, and Detect numbers and letters.
	Merge Options	The content types to merge. Valid values: Merge Chinese numbers Merge Arabic numerals Merge Chinese dates Merge Chinese times The default value is Merge Arabic numerals.
	Filter	The type of filter. Valid values are TAOBAO_CHN and INTERNET_CHN. The default value is TAOBAO_CHN.
	Pos Tagger	Specifies whether to perform part-of-speech tagging. By default, this feature is enabled.
	Semantic Tagger	Specifies whether to perform semantic tagging. By default, this feature is disabled.
	Filter tokens that are numbers	Specifies whether to filter out tokens that are numbers. By default, this feature is disabled.
	Filter tokens that are all-English words	Specifies whether to filter out tokens that consist of only English letters. By default, this feature is disabled.
	Filter tokens that are punctuation marks	Specifies whether to filter out tokens that are punctuation marks. By default, this feature is disabled.
Execution Tuning	Number of cores	The default value is automatically allocated by the system.
Execution Tuning	Memory per core	The default value is automatically allocated by the system.

Method 2: Use a PAI command

You can use a PAI command to configure the component. You can use the SQL Script component to run PAI commands. For more information, see SQL Script.

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model
    -DcolName=content
    -Dtokenizer=TAOBAO_CHN
    -DenableDfa=true
    -DenablePersonNameTagger=false
    -DenableOrgnizationTagger=false
    -DenablePosTagger=false
    -DenableTelephoneRetrievalUnit=true
    -DenableTimeRetrievalUnit=true
    -DenableDateRetrievalUnit=true
    -DenableNumberLetterRetrievalUnit=true
    -DenableChnNumMerge=false
    -DenableNumMerge=true
    -DenableChnTimeMerge=false
    -DenableChnDateMerge=false
    -DenableSemanticTagger=true

Parameter Name	Required	Description	Default Value
inputTableName	Yes	The name of the input table.	None
inputTablePartitions	No	The partitions in the input table to tokenize. The format is `partition_name=value`. For multi-level partitions, use the format `name1=value1/name2=value2`. Separate multiple partitions with commas (,).	All partitions
selectedColNames	Yes	The columns in the input table to tokenize. Separate multiple column names with commas (,).	None
dictTableName	No	Specifies whether to use a custom dictionary table. A custom dictionary table has only one column, and each row is a word.	None
tokenizer	No	The filter type. Valid values are TAOBAO_CHN and INTERNET_CHN.	TAOBAO_CHN
enableDfa	No	Specifies whether to detect simple entities. Valid values: True or False.	True
enablePersonNameTagger	No	Specifies whether to detect person names. Valid values: True or False.	False
enableOrgnizationTagger	No	Specifies whether to detect organization names. Valid values: True or False.	False
enablePosTagger	No	Specifies whether to perform part-of-speech tagging. Valid values: True or False.	False
enableTelephoneRetrievalUnit	No	Specifies whether to detect phone numbers. Valid values: True or False.	True
enableTimeRetrievalUnit	No	Specifies whether to detect time. Valid values: True or False.	True
enableDateRetrievalUnit	No	Specifies whether to detect dates. Valid values: True or False.	True
enableNumberLetterRetrievalUnit	No	Specifies whether to detect numbers and letters. Valid values: True or False.	True
enableChnNumMerge	No	Specifies whether to merge Chinese numbers into a retrieval unit. Valid values: True or False.	False
enableNumMerge	No	Specifies whether to merge standard numbers into a retrieval unit. Valid values: True or False.	True
enableChnTimeMerge	No	Specifies whether to merge Chinese time expressions into a semantic unit. Valid values: True or False.	False
enableChnDateMerge	No	Specifies whether to merge Chinese date expressions into a semantic unit. Valid values: True or False.	False
enableSemanticTagger	No	Specifies whether to perform semantic tagging. Valid values: True or False.	False
outputTableName	Yes	The name of the output table.	None
outputTablePartition	No	The partition name of the output table.	None
coreNum	No	The number of workers. This parameter takes effect only when the memSizePerCore parameter is also set. The value must be a positive integer in the range of [1,9999].	Automatically allocated by the system
memSizePerCore	No	The memory size per core, in MB. The value must be a positive integer in the range of [1024,64×1024].	Automatically allocated by the system
lifecycle	No	The lifecycle of the output table. The value must be a positive integer.	None

If the input is a standard table, do not set the coreNum and memSizePerCore parameters. The Split Word component automatically calculates the values.

If resources are limited, you can use the following code to calculate the number of workers and the memory per worker.

def CalcCoreNumAndMem(row, col, kOneCoreDataSize=1024):
    """Calculate the number of workers and the memory per worker.
       Args:
           row: The number of rows in the input table.
           col: The number of columns in the input table.
           kOneCoreDataSize: The data volume processed by a single worker, in MB. This must be a positive integer. The default value is 1024.
       Return:
           coreNum, memSizePerCore
       Example:
           coreNum, memSizePerCore = CalcCoreNumAndMem(1000, 99, kOneCoreDataSize=2048)
    """
    kMBytes = 1024.0 * 1024.0
    # Calculate the number of workers based on the data volume.
    coreNum = max(1, int(row * col * 1000/ kMBytes / kOneCoreDataSize))
    # Memory per worker = Data volume size.
    memSizePerCore = max(1024,  int(kOneCoreDataSize*2))
    return coreNum,  memSizePerCore

Example

Generate data

create table pai_aliws_test
as select
    1 as id,
    'Today is a good day. The weather is nice and sunny.' as content;

PAI command

pai -name split_word
    -project algo_public
    -DinputTableName=pai_aliws_test
    -DselectedColNames=content
    -DoutputTableName=doc_test_split_word

Input description

The input table contains two columns: an ID column and a content column.

+------------+-----------------------------------------------------+
| id         | content                                             |
+------------+-----------------------------------------------------+
| 1          | Today is a good day. The weather is fine and sunny. |

Output description
- The component tokenizes the specified column and leaves the other columns unchanged.
- If you use a custom dictionary, the system tokenizes text based on both the dictionary and the context. The tokenization may not strictly follow the custom dictionary.