This topic introduces the Split Word (Generate Model) algorithm component provided by Designer.
The Split Word (Generate Model) algorithm component is based on the Alibaba Word Segmenter (AliWS) lexical analysis system. It generates a word segmentation model based on parameters and a custom dictionary.
The Split Word (Generate Model) component supports Chinese word segmentation for the Taobao and the Internet domains.
Differences from Split Word:
-
The Split Word component directly segments the input text.
-
The Split Word (Generate Model) component generates a word segmentation model. To segment text, you must first deploy the model, and then make predictions or call the online API.
Component configuration
You can configure the Split Word (Generate Model) component in one of the following ways.
Method 1: Use the GUI
You can configure the component parameters on the Designer workflow page.
|
Tab |
Parameter |
Description |
|
Fields Setting |
Selected Field Column |
The field column used to generate the model. |
|
Parameters Setting |
Recognized Options |
The content type to detect. Valid values:
Default: Detect simple entities, Detect phone numbers, Detect time, Detect dates, and Detect numbers and letters are selected. |
|
Merge Options |
The content type to merge. Valid values:
Default: Merge Arabic numerals is selected. |
|
|
Tokenizer |
The type of filter. Valid values: TAOBAO_CHN and INTERNET_CHN. Default: TAOBAO_CHN. |
|
|
Pos Tagger |
Specifies whether to perform part-of-speech tagging. By default, this feature is disabled. |
|
|
Semantic Tagger |
Specifies whether to perform semantic tagging. By default, this feature is disabled. |
|
|
Filter out words that contain only numbers |
Specifies whether to filter out segmented words that are numbers. By default, this feature is disabled. |
|
|
Filter out words that contain only English letters |
Specifies whether to filter out segmented words that are all-English. By default, this feature is disabled. |
|
|
Filter out words that contain only punctuation marks |
Specifies whether to filter out segmented words that are punctuation marks. By default, this feature is disabled. |
|
|
Execution Tuning |
Number of cores |
By default, the system assigns it. |
|
Memory per core |
The system automatically allocates resources. |
Method 2: Use PAI commands
You can run PAI commands in the SQL Script component to configure the component. For more information, see SQL Script.
pai -name split_word_model
-project algo_public
-DoutputModelName=aliws_model
-DcolName=content
-Dtokenizer=TAOBAO_CHN
-DenableDfa=true
-DenablePersonNameTagger=false
-DenableOrgnizationTagger=false
-DenablePosTagger=false
-DenableTelephoneRetrievalUnit=true
-DenableTimeRetrievalUnit=true
-DenableDateRetrievalUnit=true
-DenableNumberLetterRetrievalUnit=true
-DenableChnNumMerge=false
-DenableNumMerge=true
-DenableChnTimeMerge=false
-DenableChnDateMerge=false
-DenableSemanticTagger=true
|
Parameter Name |
Required |
Description |
Default Value |
|
userDictTableName |
No |
Specifies whether to use a custom dictionary table. A custom dictionary table has only one column, and each row contains one word. |
None |
|
outputModelName |
Yes |
The name of the output model. |
None |
|
colName |
No |
The column name of the text for prediction. |
context |
|
dictTableName |
No |
Specifies whether to use a custom dictionary table. A custom dictionary table has only one column, and each row contains one word. |
None |
|
tokenizer |
No |
The filter type. Valid values: TAOBAO_CHN and INTERNET_CHN. |
TAOBAO_CHN |
|
enableDfa |
No |
Specifies whether to detect simple entities. Valid values: True and False. |
True |
|
enablePersonNameTagger |
No |
Specifies whether to detect person names. Valid values: True and False. |
False |
|
enableOrgnizationTagger |
No |
Specifies whether to detect organization names. Valid values: True and False. |
False |
|
enablePosTagger |
No |
Specifies whether to perform part-of-speech tagging. Valid values: True and False. |
False |
|
enableTelephoneRetrievalUnit |
No |
Specifies whether to detect phone numbers. Valid values: True and False. |
True |
|
enableTimeRetrievalUnit |
No |
Specifies whether to detect time. Valid values: True and False. |
True |
|
enableDateRetrievalUnit |
No |
Specifies whether to detect dates. Valid values: True and False. |
True |
|
enableNumberLetterRetrievalUnit |
No |
Specifies whether to detect numbers and letters. Valid values: True and False. |
True |
|
enableChnNumMerge |
No |
Specifies whether to merge Chinese numerals into a retrieval unit. Valid values: True and False. |
False |
|
enableNumMerge |
No |
Specifies whether to merge regular numbers into a retrieval unit. Valid values: True and False. |
True |
|
enableChnTimeMerge |
No |
Specifies whether to merge Chinese time expressions into a semantic unit. Valid values: True and False. |
False |
|
enableChnDateMerge |
No |
Specifies whether to merge Chinese date expressions into a semantic unit. Valid values: True and False. |
False |
|
enableSemanticTagger |
No |
Specifies whether to perform semantic tagging. Valid values: True and False |
False |
Examples
-
PAI command
pai -name split_word_model -project algo_public -DoutputModelName=aliws_model -
Deployment
create onlinemodel ning_test_aliws_model_2 -offlinemodelName ning_test_aliws_model -instanceNum 1 -cpu 100 -memory 4096; -
Online word segmentation
KVJsonRequest request = new KVJsonRequest(); Map<String, JsonFeatureValue> row = request.addRow(); row.put(col_name, new JsonFeatureValue("The big data algorithm platform is a new platform")); KVJsonResponse res = predictClient.syncPredict(new JsonPredictRequest(project_name, model_name, request)); List<ResponseItem> ri = res.getOutputs(); for (ResponseItem item : ri) { System.out.println(item.getOutputLabel()); } -
Offline word segmentation
pai -name prediction -DmodelName=ning_test_aliws_model -DinputTableName=ning_test_aliws -DoutputTableName=ning_test_aliws_offline_predict;