Configure the Word Splitting (Generate Models) component - Platform For AI

This topic introduces the Split Word (Generate Model) algorithm component provided by Designer.

The Split Word (Generate Model) algorithm component is based on the Alibaba Word Segmenter (AliWS) lexical analysis system. It generates a word segmentation model based on parameters and a custom dictionary.

The Split Word (Generate Model) component supports Chinese word segmentation for the Taobao and the Internet domains.

Differences from Split Word:

The Split Word component directly segments the input text.
The Split Word (Generate Model) component generates a word segmentation model. To segment text, you must first deploy the model, and then make predictions or call the online API.

Component configuration

You can configure the Split Word (Generate Model) component in one of the following ways.

Method 1: Use the GUI

You can configure the component parameters on the Designer workflow page.

Tab	Parameter	Description
Fields Setting	Selected Field Column	The field column used to generate the model.
Parameters Setting	Recognized Options	The content type to detect. Valid values: Detect simple entities Detect person names Detect organization names Detect phone numbers Time detected Detection date Detect numbers and letters Default: Detect simple entities, Detect phone numbers, Detect time, Detect dates, and Detect numbers and letters are selected.
	Merge Options	The content type to merge. Valid values: Merge Chinese numerals Merge Arabic numerals Merge Chinese dates Merge Chinese time Default: Merge Arabic numerals is selected.
	Tokenizer	The type of filter. Valid values: TAOBAO_CHN and INTERNET_CHN. Default: TAOBAO_CHN.
	Pos Tagger	Specifies whether to perform part-of-speech tagging. By default, this feature is disabled.
	Semantic Tagger	Specifies whether to perform semantic tagging. By default, this feature is disabled.
	Filter out words that contain only numbers	Specifies whether to filter out segmented words that are numbers. By default, this feature is disabled.
	Filter out words that contain only English letters	Specifies whether to filter out segmented words that are all-English. By default, this feature is disabled.
	Filter out words that contain only punctuation marks	Specifies whether to filter out segmented words that are punctuation marks. By default, this feature is disabled.
Execution Tuning	Number of cores	By default, the system assigns it.
Execution Tuning	Memory per core	The system automatically allocates resources.

Method 2: Use PAI commands

You can run PAI commands in the SQL Script component to configure the component. For more information, see SQL Script.

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model
    -DcolName=content
    -Dtokenizer=TAOBAO_CHN
    -DenableDfa=true
    -DenablePersonNameTagger=false
    -DenableOrgnizationTagger=false
    -DenablePosTagger=false
    -DenableTelephoneRetrievalUnit=true
    -DenableTimeRetrievalUnit=true
    -DenableDateRetrievalUnit=true
    -DenableNumberLetterRetrievalUnit=true
    -DenableChnNumMerge=false
    -DenableNumMerge=true
    -DenableChnTimeMerge=false
    -DenableChnDateMerge=false
    -DenableSemanticTagger=true

Parameter Name	Required	Description	Default Value
userDictTableName	No	Specifies whether to use a custom dictionary table. A custom dictionary table has only one column, and each row contains one word.	None
outputModelName	Yes	The name of the output model.	None
colName	No	The column name of the text for prediction.	context
dictTableName	No	Specifies whether to use a custom dictionary table. A custom dictionary table has only one column, and each row contains one word.	None
tokenizer	No	The filter type. Valid values: TAOBAO_CHN and INTERNET_CHN.	TAOBAO_CHN
enableDfa	No	Specifies whether to detect simple entities. Valid values: True and False.	True
enablePersonNameTagger	No	Specifies whether to detect person names. Valid values: True and False.	False
enableOrgnizationTagger	No	Specifies whether to detect organization names. Valid values: True and False.	False
enablePosTagger	No	Specifies whether to perform part-of-speech tagging. Valid values: True and False.	False
enableTelephoneRetrievalUnit	No	Specifies whether to detect phone numbers. Valid values: True and False.	True
enableTimeRetrievalUnit	No	Specifies whether to detect time. Valid values: True and False.	True
enableDateRetrievalUnit	No	Specifies whether to detect dates. Valid values: True and False.	True
enableNumberLetterRetrievalUnit	No	Specifies whether to detect numbers and letters. Valid values: True and False.	True
enableChnNumMerge	No	Specifies whether to merge Chinese numerals into a retrieval unit. Valid values: True and False.	False
enableNumMerge	No	Specifies whether to merge regular numbers into a retrieval unit. Valid values: True and False.	True
enableChnTimeMerge	No	Specifies whether to merge Chinese time expressions into a semantic unit. Valid values: True and False.	False
enableChnDateMerge	No	Specifies whether to merge Chinese date expressions into a semantic unit. Valid values: True and False.	False
enableSemanticTagger	No	Specifies whether to perform semantic tagging. Valid values: True and False	False

Examples

PAI command

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model

Deployment

create onlinemodel ning_test_aliws_model_2 -offlinemodelName ning_test_aliws_model -instanceNum 1 -cpu 100 -memory 4096;

Online word segmentation

KVJsonRequest request = new KVJsonRequest();
Map<String, JsonFeatureValue> row = request.addRow();
row.put(col_name, new JsonFeatureValue("The big data algorithm platform is a new platform"));
KVJsonResponse res = predictClient.syncPredict(new JsonPredictRequest(project_name, model_name, request));
List<ResponseItem> ri = res.getOutputs();
for (ResponseItem item : ri) {
        System.out.println(item.getOutputLabel());
 }

Offline word segmentation

pai -name prediction
    -DmodelName=ning_test_aliws_model
    -DinputTableName=ning_test_aliws
    -DoutputTableName=ning_test_aliws_offline_predict;