All Products
Search
Document Center

Platform For AI:Split word

Last Updated:Mar 05, 2026

This topic describes the Split Word component provided by Designer.

The Split Word component uses the Alibaba Word Segmenter (AliWS) to tokenize content in a specified column. The resulting tokens are separated by spaces. If you configure part-of-speech (POS) tagging or semantic tagging, the output includes the tokens, POS tags, and semantic tags. POS tags are separated by a forward slash (/). Semantic tags are separated by a vertical bar (|).

The Split Word component supports only the TAOBAO_CHN and INTERNET_CHN tokenizers.

You can configure the Split Word component in Designer using the GUI or PAI commands.

Component configuration

You can configure the Split Word component in the following ways.

Method 1: Use the GUI

You can configure the component on the workflow page of Designer.

Tab

Parameter

Description

Fields Setting

Column Name

The column to tokenize.

Parameters Setting

Recognition Options

The content types to detect. Valid values:

  • Detect simple entities

  • Detect person names

  • Detect organization names

  • Detect phone numbers

  • Detect time

  • Detect date

  • Detect numbers and letters

Default values: Detect simple entities, Detect phone numbers, Detect time, Detect date, and Detect numbers and letters.

Merge Options

The content types to merge. Valid values:

  • Merge Chinese numbers

  • Merge Arabic numerals

  • Merge Chinese dates

  • Merge Chinese times

The default value is Merge Arabic numerals.

Filter

The type of filter. Valid values are TAOBAO_CHN and INTERNET_CHN. The default value is TAOBAO_CHN.

Pos Tagger

Specifies whether to perform part-of-speech tagging. By default, this feature is enabled.

Semantic Tagger

Specifies whether to perform semantic tagging. By default, this feature is disabled.

Filter tokens that are numbers

Specifies whether to filter out tokens that are numbers. By default, this feature is disabled.

Filter tokens that are all-English words

Specifies whether to filter out tokens that consist of only English letters. By default, this feature is disabled.

Filter tokens that are punctuation marks

Specifies whether to filter out tokens that are punctuation marks. By default, this feature is disabled.

Execution Tuning

Number of cores

The default value is automatically allocated by the system.

Memory per core

The default value is automatically allocated by the system.

Method 2: Use a PAI command

You can use a PAI command to configure the component. You can use the SQL Script component to run PAI commands. For more information, see SQL Script.

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model
    -DcolName=content
    -Dtokenizer=TAOBAO_CHN
    -DenableDfa=true
    -DenablePersonNameTagger=false
    -DenableOrgnizationTagger=false
    -DenablePosTagger=false
    -DenableTelephoneRetrievalUnit=true
    -DenableTimeRetrievalUnit=true
    -DenableDateRetrievalUnit=true
    -DenableNumberLetterRetrievalUnit=true
    -DenableChnNumMerge=false
    -DenableNumMerge=true
    -DenableChnTimeMerge=false
    -DenableChnDateMerge=false
    -DenableSemanticTagger=true

Parameter Name

Required

Description

Default Value

inputTableName

Yes

The name of the input table.

None

inputTablePartitions

No

The partitions in the input table to tokenize. The format is partition_name=value. For multi-level partitions, use the format name1=value1/name2=value2. Separate multiple partitions with commas (,).

All partitions

selectedColNames

Yes

The columns in the input table to tokenize. Separate multiple column names with commas (,).

None

dictTableName

No

Specifies whether to use a custom dictionary table. A custom dictionary table has only one column, and each row is a word.

None

tokenizer

No

The filter type. Valid values are TAOBAO_CHN and INTERNET_CHN.

TAOBAO_CHN

enableDfa

No

Specifies whether to detect simple entities. Valid values: True or False.

True

enablePersonNameTagger

No

Specifies whether to detect person names. Valid values: True or False.

False

enableOrgnizationTagger

No

Specifies whether to detect organization names. Valid values: True or False.

False

enablePosTagger

No

Specifies whether to perform part-of-speech tagging. Valid values: True or False.

False

enableTelephoneRetrievalUnit

No

Specifies whether to detect phone numbers. Valid values: True or False.

True

enableTimeRetrievalUnit

No

Specifies whether to detect time. Valid values: True or False.

True

enableDateRetrievalUnit

No

Specifies whether to detect dates. Valid values: True or False.

True

enableNumberLetterRetrievalUnit

No

Specifies whether to detect numbers and letters. Valid values: True or False.

True

enableChnNumMerge

No

Specifies whether to merge Chinese numbers into a retrieval unit. Valid values: True or False.

False

enableNumMerge

No

Specifies whether to merge standard numbers into a retrieval unit. Valid values: True or False.

True

enableChnTimeMerge

No

Specifies whether to merge Chinese time expressions into a semantic unit. Valid values: True or False.

False

enableChnDateMerge

No

Specifies whether to merge Chinese date expressions into a semantic unit. Valid values: True or False.

False

enableSemanticTagger

No

Specifies whether to perform semantic tagging. Valid values: True or False.

False

outputTableName

Yes

The name of the output table.

None

outputTablePartition

No

The partition name of the output table.

None

coreNum

No

The number of workers. This parameter takes effect only when the memSizePerCore parameter is also set. The value must be a positive integer in the range of [1,9999].

Automatically allocated by the system

memSizePerCore

No

The memory size per core, in MB. The value must be a positive integer in the range of [1024,64×1024].

Automatically allocated by the system

lifecycle

No

The lifecycle of the output table. The value must be a positive integer.

None

If the input is a standard table, do not set the coreNum and memSizePerCore parameters. The Split Word component automatically calculates the values.

If resources are limited, you can use the following code to calculate the number of workers and the memory per worker.

def CalcCoreNumAndMem(row, col, kOneCoreDataSize=1024):
    """Calculate the number of workers and the memory per worker.
       Args:
           row: The number of rows in the input table.
           col: The number of columns in the input table.
           kOneCoreDataSize: The data volume processed by a single worker, in MB. This must be a positive integer. The default value is 1024.
       Return:
           coreNum, memSizePerCore
       Example:
           coreNum, memSizePerCore = CalcCoreNumAndMem(1000, 99, kOneCoreDataSize=2048)
    """
    kMBytes = 1024.0 * 1024.0
    # Calculate the number of workers based on the data volume.
    coreNum = max(1, int(row * col * 1000/ kMBytes / kOneCoreDataSize))
    # Memory per worker = Data volume size.
    memSizePerCore = max(1024,  int(kOneCoreDataSize*2))
    return coreNum,  memSizePerCore

Example

  • Generate data

    create table pai_aliws_test
    as select
        1 as id,
        'Today is a good day. The weather is nice and sunny.' as content;
  • PAI command

    pai -name split_word
        -project algo_public
        -DinputTableName=pai_aliws_test
        -DselectedColNames=content
        -DoutputTableName=doc_test_split_word
  • Input description

    The input table contains two columns: an ID column and a content column.

    +------------+-----------------------------------------------------+
    | id         | content                                             |
    +------------+-----------------------------------------------------+
    | 1          | Today is a good day. The weather is fine and sunny. |
  • Output description

    • The component tokenizes the specified column and leaves the other columns unchanged.

    • If you use a custom dictionary, the system tokenizes text based on both the dictionary and the context. The tokenization may not strictly follow the custom dictionary.