This topic describes the Word Splitting (Generate Models) component provided by Machine Learning Studio.

The Word Splitting (Generate Models) component is based on Alibaba Word Segmenter (AliWS). The component is used to generate a word segmentation model based on parameters and custom dictionaries.

The component supports only Chinese Taobao word segmentation and Internet word segmentation.

The Word Splitting (Generate Models) component differs from the Word Splitting component in the following ways:
  • The Word Splitting component splits texts into words.
  • The Word Splitting (Generate Models) component generates a word segmentation model. To split texts, you must deploy a model and make a prediction or call an API operation.

Configure the component

  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Columns The column that is used to generate a model.
    Parameters Setting Recognition Options The types of content for recognition. Valid values:
    • Recognize Simple Entities
    • Recognize Individual Names
    • Recognize Organization Names
    • Recognize Telephone Numbers
    • Recognize Times
    • Recognize Dates
    • Recognize Alphanumeric Characters

    By default, the following options are selected: Recognize Simple Entities, Recognize Telephone Numbers, Recognize Times, Recognize Dates, and Recognize Alphanumeric Characters.

    Merge Options The types of content for merging. Valid values:
    • Merge Chinese Numbers
    • Merge Arabic Numerals
    • Merge Chinese Dates
    • Merge Chinese Times

    Default value: Merge Arabic Numbers.

    Tokenizer The type of the filter. Valid values: TAOBAO_CHN and INTERNET_CHN. Default value: TAOBAO_CHN.
    Pos Tagger Specifies whether to perform part-of-speech tagging. By default, part-of-speech tagging is not performed.
    Semantic Tagger Specifies whether to perform semantic role labeling. By default, semantic role labeling is not performed.
    Filter Out Words That Contain Only Numbers Specifies whether to filter out words whose word segmentation results are numbers. By default, this option is cleared.
    Filter Out Words That Contain Only English Letters Specifies whether to filter out words whose word segmentation results are English letters. By default, this option is cleared.
    Filter Out Words That Contain Only English Punctuations Specifies whether to filter out words whose word segmentation results are punctuation marks. By default, this option is cleared.
    Tuning Cores The number of cores used for calculation. The value is automatically allocated.
    Memory Size per Core The size of memory required by each core. The value is automatically allocated.
  • PAI command
    pai -name split_word_model
        -project algo_public
        -DoutputModelName=aliws_model
        -DcolName=content
        -Dtokenizer=TAOBAO_CHN
        -DenableDfa=true
        -DenablePersonNameTagger=false
        -DenableOrgnizationTagger=false
        -DenablePosTagger=false
        -DenableTelephoneRetrievalUnit=true
        -DenableTimeRetrievalUnit=true
        -DenableDateRetrievalUnit=true
        -DenableNumberLetterRetrievalUnit=true
        -DenableChnNumMerge=false
        -DenableNumMerge=true
        -DenableChnTimeMerge=false
        -DenableChnDateMerge=false
        -DenableSemanticTagger=true
    Parameter Required Description Default value
    userDictTableName No Specifies whether to use a custom dictionary. A custom dictionary has only one column, and each row contains only one word. No default value
    outputModelName Yes The name of the output model. No default value
    colName No The column name of the prediction text. context
    dictTableName No Specifies whether to use a custom dictionary. A custom dictionary has only one column, and each row contains only one word. No default value
    tokenizer No The type of the filter. Valid values: TAOBAO_CHN and INTERNET_CHN. TAOBAO_CHN
    enableDfa No Specifies whether to recognize simple entities. Valid values: True and False. True
    enablePersonNameTagger No Specifies whether to recognize individual names. Valid values: True and False. False
    enableOrgnizationTagger No Specifies whether to recognize organization names. Valid values: True and False. False
    enablePosTagger No Specifies whether to perform part-of-speech tagging. Valid values: True and False. False
    enableTelephoneRetrievalUnit No Specifies whether to recognize telephone numbers. Valid values: True and False. True
    enableTimeRetrievalUnit No Specifies whether to recognize time expressions. Valid values: True and False. True
    enableDateRetrievalUnit No Specifies whether to recognize date expressions. Valid values: True or False. True
    enableNumberLetterRetrievalUnit No Specifies whether to recognize alphanumeric characters. Valid values: True and False. True
    enableChnNumMerge No Specifies whether to merge Chinese numbers into a retrieval unit. Valid values: True and False. False
    enableNumMerge No Specifies whether to merge Arabic numerals into a retrieval unit. Valid values: True and False. True
    enableChnTimeMerge No Specifies whether to merge Chinese time expressions into a semantic unit. Valid values: True and False. False
    enableChnDateMerge No Specifies whether to merge Chinese date expressions into a semantic unit. Valid values: True and False. False
    enableSemanticTagger No Specifies whether to perform semantic role labeling. Valid values: True and False. False

Examples

  • PAI command
    pai -name split_word_model
        -project algo_public
        -DoutputModelName=aliws_model
  • Model deployment
    create onlinemodel ning_test_aliws_model_2 -offlinemodelName ning_test_aliws_model -instanceNum 1 -cpu 100 -memory 4096;
  • Online word segmentation
    KVJsonRequest request = new KVJsonRequest();
    Map<String, JsonFeatureValue> row = request.addRow();
    row.put(col_name, new JsonFeatureValue("The big data algorithm platform is new"));
    KVJsonResponse res = predictClient.syncPredict(new JsonPredictRequest(project_name, model_name, request));
    List<ResponseItem> ri = res.getOutputs();
    for (ResponseItem item : ri) {
            System.out.println(item.getOutputLabel());
     }
  • Offline word segmentation
    pai -name prediction
        -DmodelName=ning_test_aliws_model
        -DinputTableName=ning_test_aliws
        -DoutputTableName=ning_test_aliws_offline_predict;