All Products
Search
Document Center

Platform For AI:Word Splitting (Generate Models)

Last Updated:Nov 22, 2023

This topic describes the Word Splitting (Generate Models) component provided by Machine Learning Designer (formerly known as Machine Learning Studio).

The Word Splitting (Generate Models) component is based on Alibaba Word Segmenter (AliWS). The component is used to generate a word segmentation model based on parameters and custom dictionaries.

The component supports only Chinese Taobao word segmentation and Internet word segmentation.

The Word Splitting (Generate Models) component differs from the Word Splitting component in the following ways:

  • The Word Splitting component splits texts into words.

  • The Word Splitting (Generate Models) component generates a word segmentation model. To split texts, you must deploy a model and make a prediction or call an API operation.

Configure the component

You can use one of the following methods to configure the Word Splitting (Generate Models) component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Word Splitting (Generate Models) component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Select Columns

The column that is used to generate a model.

Parameters Setting

Recognition Options

The types of content for recognition. Valid values:

  • Recognize Simple Entities

  • Recognize Individual Names

  • Recognize Organization Names

  • Recognize Telephone Numbers

  • Recognize Times

  • Recognize Dates

  • Recognize Alphanumeric Characters

By default, the following options are selected: Recognize Simple Entities, Recognize Telephone Numbers, Recognize Times, Recognize Dates, and Recognize Alphanumeric Characters.

Merge Options

The types of content for merging. Valid values:

  • Merge Chinese Numbers

  • Merge Arabic Numerals

  • Merge Chinese Dates

  • Merge Chinese Times

Default value: Merge Arabic Numbers.

Tokenizer

The type of the filter. Valid values: TAOBAO_CHN and INTERNET_CHN. Default value: TAOBAO_CHN.

POS Tagger

Specifies whether to perform part-of-speech tagging. By default, part-of-speech tagging is not performed.

Semantic Tagger

Specifies whether to perform semantic role labeling. By default, semantic role labeling is not performed.

Filter Out Words That Contain Only Numbers

Specifies whether to filter out words whose word segmentation results are numbers. By default, this option is cleared.

Filter Out Words That Contain Only English Letters

Specifies whether to filter out words whose word segmentation results are English letters. By default, this option is cleared.

Filter Out Words That Contain Only Punctuations

Specifies whether to filter out words whose word segmentation results are punctuation marks. By default, this option is cleared.

Tuning

Cores

The number of cores. By default, the system determines the value.

Memory Size per Core

The memory size of each core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model
    -DcolName=content
    -Dtokenizer=TAOBAO_CHN
    -DenableDfa=true
    -DenablePersonNameTagger=false
    -DenableOrgnizationTagger=false
    -DenablePosTagger=false
    -DenableTelephoneRetrievalUnit=true
    -DenableTimeRetrievalUnit=true
    -DenableDateRetrievalUnit=true
    -DenableNumberLetterRetrievalUnit=true
    -DenableChnNumMerge=false
    -DenableNumMerge=true
    -DenableChnTimeMerge=false
    -DenableChnDateMerge=false
    -DenableSemanticTagger=true

Parameter

Required

Description

Default value

userDictTableName

No

Specifies whether to use a custom dictionary. A custom dictionary has only one column, and each row contains only one word.

No default value

outputModelName

Yes

The name of the output model.

No default value

colName

No

The column name of the prediction text.

context

dictTableName

No

Specifies whether to use a custom dictionary. A custom dictionary has only one column, and each row contains only one word.

No default value

tokenizer

No

The type of the filter. Valid values: TAOBAO_CHN and INTERNET_CHN.

TAOBAO_CHN

enableDfa

No

Specifies whether to recognize simple entities. Valid values: True and False.

True

enablePersonNameTagger

No

Specifies whether to recognize individual names. Valid values: True and False.

False

enableOrgnizationTagger

No

Specifies whether to recognize organization names. Valid values: True and False.

False

enablePosTagger

No

Specifies whether to perform part-of-speech tagging. Valid values: True and False.

False

enableTelephoneRetrievalUnit

No

Specifies whether to recognize telephone numbers. Valid values: True and False.

True

enableTimeRetrievalUnit

No

Specifies whether to recognize time expressions. Valid values: True and False.

True

enableDateRetrievalUnit

No

Specifies whether to recognize date expressions. Valid values: True and False.

True

enableNumberLetterRetrievalUnit

No

Specifies whether to recognize digits and letters. Valid values: True and False.

True

enableChnNumMerge

No

Specifies whether to merge Chinese numbers into a retrieval unit. Valid values: True and False.

False

enableNumMerge

No

Specifies whether to merge Arabic numerals into a retrieval unit. Valid values: True and False.

True

enableChnTimeMerge

No

Specifies whether to merge Chinese time expressions into a semantic unit. Valid values: True and False.

False

enableChnDateMerge

No

Specifies whether to merge Chinese date expressions into a semantic unit. Valid values: True and False.

False

enableSemanticTagger

No

Specifies whether to perform semantic role labeling. Valid values: True and False.

False

Examples

  • PAI command

    pai -name split_word_model
        -project algo_public
        -DoutputModelName=aliws_model
  • Model deployment

    create onlinemodel ning_test_aliws_model_2 -offlinemodelName ning_test_aliws_model -instanceNum 1 -cpu 100 -memory 4096;
  • Online word segmentation

    KVJsonRequest request = new KVJsonRequest();
    Map<String, JsonFeatureValue> row = request.addRow();
    row.put(col_name, new JsonFeatureValue("The big data algorithm platform is new"));
    KVJsonResponse res = predictClient.syncPredict(new JsonPredictRequest(project_name, model_name, request));
    List<ResponseItem> ri = res.getOutputs();
    for (ResponseItem item : ri) {
            System.out.println(item.getOutputLabel());
     }
  • Offline word segmentation

    pai -name prediction
        -DmodelName=ning_test_aliws_model
        -DinputTableName=ning_test_aliws
        -DoutputTableName=ning_test_aliws_offline_predict;