This topic describes the Word Splitting component provided by Machine Learning Studio.

This component splits words in specific columns based on Alibaba Word Segmenter (AliWS). The words after splitting are separated by spaces. If you set the POS tagger or Semantic Tagger parameter, the system provides words after splitting, the Part-of-Speech (POS) tagging results, and the semantic tagging results. The POS tagging results are separated by forward slashes (/) and semantic tagging results are separated by vertical bars (|).

The tokenizer can be TAOBAO_CHN or INTERNET_CHN.

You can configure the component by using the Machine Learning Platform for AI console or a PAI command.

Configure the component

  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Settings Columns The columns used for word splitting.
    Parameters Setting Recognition Options The content types to be recognized. Valid values:
    • Recognize Simple Entities
    • Recognize Individual Names
    • Recognize Organization Names
    • Recognize Telephone Numbers
    • Recognize Times
    • Recognize Dates
    • Recognize Alphanumeric Characters

    By default, Recognize Simple Entities, Recognize Telephone Numbers, Recognize Times, Recognize Dates, and Recognize Alphanumeric Characters are selected.

    Merge Options The types of content to be merged. Valid values:
    • Merge Chinese Numbers
    • Merge Arabic Numbers
    • Merge Chinese Dates
    • Merge Chinese Times

    Default value: Merge Arabic Numbers.

    Tokenizer The tokenizer. Valid values: TAOBAO_CHN and INTERNET_CHN. Default value: TAOBAO_CHN.
    POS Tagger Specifies whether to enable POS tagging. By default, POS tagging is enabled.
    Semantic Tagger Specifies whether to enable semantic tagging. By default, semantic tagging is disabled.
    Filter Out Words That Contain Only Numbers Specifies whether to filter out words that contain only numbers. By default, such words are not filtered out.
    Filter Out Words That Contain Only English Letters Specifies whether to filter out words that contain only letters. By default, such words are not filtered out.
    Filter Out Words That Contain Only Punctuations Specifies whether to filter out words that contain only punctuations. By default, such words are not filtered out.
    Tuning Cores Automatically allocated.
    Memory Size per Core Automatically allocated.
  • PAI command
    pai -name split_word_model
        -project algo_public
        -DoutputModelName=aliws_model
        -DcolName=content
        -Dtokenizer=TAOBAO_CHN
        -DenableDfa=true
        -DenablePersonNameTagger=false
        -DenableOrgnizationTagger=false
        -DenablePosTagger=false
        -DenableTelephoneRetrievalUnit=true
        -DenableTimeRetrievalUnit=true
        -DenableDateRetrievalUnit=true
        -DenableNumberLetterRetrievalUnit=true
        -DenableChnNumMerge=false
        -DenableNumMerge=true
        -DenableChnTimeMerge=false
        -DenableChnDateMerge=false
        -DenableSemanticTagger=true
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    inputTablePartitions No The partitions selected from the input table for word splitting. This value must be in the partition_name=value format. To specify multiple partitions, use the following format: name1=value1/name2=value2. If multiple partitions are specified, separate them with commas (,). Full table
    selectedColNames Yes The names of the columns selected from the input table for word splitting. If multiple partitions are specified, separate them with commas (,). No default value
    dictTableName No Specifies whether to use a custom dictionary. The custom dictionary has only one column, and each row has only one word. No default value
    tokenizer No The tokenizer. Valid values: TAOBAO_CHN and INTERNET_CHN. TAOBAO_CHN
    enableDfa No Specifies whether to recognize simple entities. Valid values: True and False. True
    enablePersonNameTagger No Specifies whether to recognize person names. Valid values: True and False. False
    enableOrgnizationTagger No Specifies whether to recognize organizations. Valid values: True and False. False
    enablePosTagger No Specifies whether to enable POS tagging. Valid values: True and False. False
    enableTelephoneRetrievalUnit No Specifies whether to recognize phone numbers. Valid values: True and False. True
    enableTimeRetrievalUnit No Specifies whether to recognize time. Valid values: True and False. True
    enableDateRetrievalUnit No Specifies whether to recognize dates. Valid values: True and False. True
    enableNumberLetterRetrievalUnit No Specifies whether to recognize digits and letters. Valid values: True and False. True
    enableChnNumMerge No Specifies whether to merge Chinese characters for numbers into retrieval units. Valid values: True and False. False
    enableNumMerge No Specifies whether to merge Arabic numerals into retrieval units. Valid values: True and False. True
    enableChnTimeMerge No Specifies whether to merge Chinese characters for time into semantic units. Valid values: True and False. False
    enableChnDateMerge No Specifies whether to merge Chinese characters for dates into semantic units. Valid values: True and False. False
    enableSemanticTagger No Specifies whether to enable semantic tagging. Valid values: True and False. False
    outputTableName Yes The name of the output table. No default value
    outputTablePartition No The names of the partitions in the output table. No default value
    coreNum No The number of cores. This parameter takes effect only when memSizePerCore is configured. The value must be a positive integer in the range of [1,9999]. Automatically allocated
    memSizePerCore No The memory size for each core. Unit: MB. The value must be a positive integer in the range of [1024,64 × 1024]. Automatically allocated
    lifecycle No The lifecycle of the output table. The value must be a positive integer. No default value

    If you use a regular table, we recommend that you do not set coreNum or memSizePerCore. The Word Splitting component automatically configures the parameters by default.

    If your resources are limited, you can use the following code to calculate the number of cores and the memory size for each core.
    def CalcCoreNumAndMem(row, col, kOneCoreDataSize=1024):
        """Calculates the number of cores and memory size for each core.
           Args:
               row: the number of rows in the input table.
               col: the number of columns in the input table.
               kOneCoreDataSize: the amount of data that can be computed by each core. Unit: MB. The value must be a positive integer. Default value: 1024.
           Return:
               coreNum, memSizePerCore
           Example:
               coreNum, memSizePerCore = CalcCoreNumAndMem(1000,99, 100, kOneCoreDataSize=2048)
        """
        kMBytes = 1024.0 * 1024.0
        #Number of cores involved in computing
        coreNum = max(1, int(row * col * 1000/ kMBytes / kOneCoreDataSize))
        #Memory size per core = Data amount
        memSizePerCore = max(1024,  int(kOneCoreDataSize*2))
        return coreNum,  memSizePerCore

Example

  • Generated data
    create table pai_aliws_test
    as select
        1 as id,
        'Today is a good day. The weather is fine and sunny.' as content
    from  dual;
  • PAI command
    pai -name split_word
        -project algo_public
        -DinputTableName=pai_aliws_test
        -DselectedColNames=content
        -DoutputTableName=doc_test_split_word
  • Input description
    The input table consists of two columns: id and content.
    +------------+------------+
    | id         | content    |
    +------------+------------+
    | 1          | Today is a good day. The weather is fine and sunny. |
  • Output description
    • The words in the tokenization column of the input table are split and then returned. The rest columns are returned without changes.
    • When a custom dictionary is used, the system splits words based on the custom dictionary and context.