During word frequency calculation, a program is used to calculate the total number of words in strings and the number of times that each word appears in the strings. The strings can be manually entered or read from a specified file. The total number of words is the number of distinct words. This topic describes the Word Frequency Statistics component that is provided by Machine Learning Studio.

Word frequency refers to the number of times that a word appears in a corpus. This component generates the words in their original order based on the word segmentation results. Then, it calculates the number of times that each word appears in the content (docContent) of documents specified by the document ID column (docId).

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Document ID Column The column that contains the IDs of the specified documents.
    Document Content Column The column that contains the content of the specified documents.
    Tuning Cores The number of cores used for calculation.
    Memory Size per Core The memory size of each core. Unit: MB.
  • Machine Learning Platform for AI command
    pai -name doc_word_stat
        -project algo_public
        -DinputTableName=tdl_doc_test_split_word
        -DdocId=docid
        -DdocContent=content
        -DoutputTableNameMulti=doc_test_stat_multi
        -DoutputTableNameTriple=doc_test_stat_triple
        -Dlifecycle=7
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. N/A
    docId Yes The name of the document ID column. Only one column can be specified. N/A
    docContent Yes The name of the document content column. Only one column can be specified. N/A
    outputTableNameMulti Yes The name of the output table that lists the words in their original order after word segmentation. N/A
    outputTableNameTriple No The name of the output table that lists the number of times that each word appears in the documents. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. The following formats are supported:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note Separate multiple partitions with commas (,).
    All partitions
    lifecycle No The lifecycle of the output table. The value must be a positive integer. -1

FAQ

  • The outputTableNameMulti parameter specifies the output table that lists words in their original order in the documents after word segmentation. Word segmentation is performed based on docId and docContent.
  • The outputTableNameTriple parameter specifies the output table that lists the number of times that each word appears in the documents after word segmentation. Word segmentation is performed based on docId and docContent.