This topic describes the Split Word component provided by Designer.
The Split Word component uses the Alibaba Word Segmenter (AliWS) to tokenize content in a specified column. The resulting tokens are separated by spaces. If you configure part-of-speech (POS) tagging or semantic tagging, the output includes the tokens, POS tags, and semantic tags. POS tags are separated by a forward slash (/). Semantic tags are separated by a vertical bar (|).
The Split Word component supports only the TAOBAO_CHN and INTERNET_CHN tokenizers.
You can configure the Split Word component in Designer using the GUI or PAI commands.
Component configuration
You can configure the Split Word component in the following ways.
Method 1: Use the GUI
You can configure the component on the workflow page of Designer.
|
Tab |
Parameter |
Description |
|
Fields Setting |
Column Name |
The column to tokenize. |
|
Parameters Setting |
Recognition Options |
The content types to detect. Valid values:
Default values: Detect simple entities, Detect phone numbers, Detect time, Detect date, and Detect numbers and letters. |
|
Merge Options |
The content types to merge. Valid values:
The default value is Merge Arabic numerals. |
|
|
Filter |
The type of filter. Valid values are TAOBAO_CHN and INTERNET_CHN. The default value is TAOBAO_CHN. |
|
|
Pos Tagger |
Specifies whether to perform part-of-speech tagging. By default, this feature is enabled. |
|
|
Semantic Tagger |
Specifies whether to perform semantic tagging. By default, this feature is disabled. |
|
|
Filter tokens that are numbers |
Specifies whether to filter out tokens that are numbers. By default, this feature is disabled. |
|
|
Filter tokens that are all-English words |
Specifies whether to filter out tokens that consist of only English letters. By default, this feature is disabled. |
|
|
Filter tokens that are punctuation marks |
Specifies whether to filter out tokens that are punctuation marks. By default, this feature is disabled. |
|
|
Execution Tuning |
Number of cores |
The default value is automatically allocated by the system. |
|
Memory per core |
The default value is automatically allocated by the system. |
Method 2: Use a PAI command
You can use a PAI command to configure the component. You can use the SQL Script component to run PAI commands. For more information, see SQL Script.
pai -name split_word_model
-project algo_public
-DoutputModelName=aliws_model
-DcolName=content
-Dtokenizer=TAOBAO_CHN
-DenableDfa=true
-DenablePersonNameTagger=false
-DenableOrgnizationTagger=false
-DenablePosTagger=false
-DenableTelephoneRetrievalUnit=true
-DenableTimeRetrievalUnit=true
-DenableDateRetrievalUnit=true
-DenableNumberLetterRetrievalUnit=true
-DenableChnNumMerge=false
-DenableNumMerge=true
-DenableChnTimeMerge=false
-DenableChnDateMerge=false
-DenableSemanticTagger=true
|
Parameter Name |
Required |
Description |
Default Value |
|
inputTableName |
Yes |
The name of the input table. |
None |
|
inputTablePartitions |
No |
The partitions in the input table to tokenize. The format is |
All partitions |
|
selectedColNames |
Yes |
The columns in the input table to tokenize. Separate multiple column names with commas (,). |
None |
|
dictTableName |
No |
Specifies whether to use a custom dictionary table. A custom dictionary table has only one column, and each row is a word. |
None |
|
tokenizer |
No |
The filter type. Valid values are TAOBAO_CHN and INTERNET_CHN. |
TAOBAO_CHN |
|
enableDfa |
No |
Specifies whether to detect simple entities. Valid values: True or False. |
True |
|
enablePersonNameTagger |
No |
Specifies whether to detect person names. Valid values: True or False. |
False |
|
enableOrgnizationTagger |
No |
Specifies whether to detect organization names. Valid values: True or False. |
False |
|
enablePosTagger |
No |
Specifies whether to perform part-of-speech tagging. Valid values: True or False. |
False |
|
enableTelephoneRetrievalUnit |
No |
Specifies whether to detect phone numbers. Valid values: True or False. |
True |
|
enableTimeRetrievalUnit |
No |
Specifies whether to detect time. Valid values: True or False. |
True |
|
enableDateRetrievalUnit |
No |
Specifies whether to detect dates. Valid values: True or False. |
True |
|
enableNumberLetterRetrievalUnit |
No |
Specifies whether to detect numbers and letters. Valid values: True or False. |
True |
|
enableChnNumMerge |
No |
Specifies whether to merge Chinese numbers into a retrieval unit. Valid values: True or False. |
False |
|
enableNumMerge |
No |
Specifies whether to merge standard numbers into a retrieval unit. Valid values: True or False. |
True |
|
enableChnTimeMerge |
No |
Specifies whether to merge Chinese time expressions into a semantic unit. Valid values: True or False. |
False |
|
enableChnDateMerge |
No |
Specifies whether to merge Chinese date expressions into a semantic unit. Valid values: True or False. |
False |
|
enableSemanticTagger |
No |
Specifies whether to perform semantic tagging. Valid values: True or False. |
False |
|
outputTableName |
Yes |
The name of the output table. |
None |
|
outputTablePartition |
No |
The partition name of the output table. |
None |
|
coreNum |
No |
The number of workers. This parameter takes effect only when the memSizePerCore parameter is also set. The value must be a positive integer in the range of [1,9999]. |
Automatically allocated by the system |
|
memSizePerCore |
No |
The memory size per core, in MB. The value must be a positive integer in the range of [1024,64×1024]. |
Automatically allocated by the system |
|
lifecycle |
No |
The lifecycle of the output table. The value must be a positive integer. |
None |
If the input is a standard table, do not set the coreNum and memSizePerCore parameters. The Split Word component automatically calculates the values.
If resources are limited, you can use the following code to calculate the number of workers and the memory per worker.
def CalcCoreNumAndMem(row, col, kOneCoreDataSize=1024):
"""Calculate the number of workers and the memory per worker.
Args:
row: The number of rows in the input table.
col: The number of columns in the input table.
kOneCoreDataSize: The data volume processed by a single worker, in MB. This must be a positive integer. The default value is 1024.
Return:
coreNum, memSizePerCore
Example:
coreNum, memSizePerCore = CalcCoreNumAndMem(1000, 99, kOneCoreDataSize=2048)
"""
kMBytes = 1024.0 * 1024.0
# Calculate the number of workers based on the data volume.
coreNum = max(1, int(row * col * 1000/ kMBytes / kOneCoreDataSize))
# Memory per worker = Data volume size.
memSizePerCore = max(1024, int(kOneCoreDataSize*2))
return coreNum, memSizePerCore
Example
-
Generate data
create table pai_aliws_test as select 1 as id, 'Today is a good day. The weather is nice and sunny.' as content; -
PAI command
pai -name split_word -project algo_public -DinputTableName=pai_aliws_test -DselectedColNames=content -DoutputTableName=doc_test_split_word -
Input description
The input table contains two columns: an ID column and a content column.
+------------+-----------------------------------------------------+ | id | content | +------------+-----------------------------------------------------+ | 1 | Today is a good day. The weather is fine and sunny. | -
Output description
-
The component tokenizes the specified column and leaves the other columns unchanged.
-
If you use a custom dictionary, the system tokenizes text based on both the dictionary and the context. The tokenization may not strictly follow the custom dictionary.
-