Word Frequency Statistics - Platform For AI - Alibaba Cloud Documentation Center

During word frequency calculation, a program is used to calculate the total number of words in strings and the number of times that each word appears in the strings. The strings can be manually entered or read from a specified file. The total number of words is the number of distinct words. This topic describes the Word Frequency Statistics component that is provided by Machine Learning Designer (formerly known as Machine Learning Studio).

Word frequency refers to the number of times that a word appears in a corpus. This component generates the words in their original order based on the word segmentation results. Then, it calculates the number of times that each word appears in the content (docContent) of documents specified by the document ID column (docId).

Configure the component

You can use one of the following methods to configure the Word Frequency Statistics component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Word Frequency Statistics component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.


Tab	Parameter	Description
Fields Setting	Document ID Column	The column that contains the IDs of the specified documents.
Fields Setting	Document Content Column	The column that contains the content of the specified documents.
Tuning	Cores	The number of cores used for calculation.
Tuning	Memory Size per Core	The memory size of each core. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

pai -name doc_word_stat
    -project algo_public
    -DinputTableName=tdl_doc_test_split_word
    -DdocId=docid
    -DdocContent=content
    -DoutputTableNameMulti=doc_test_stat_multi
    -DoutputTableNameTriple=doc_test_stat_triple
    -Dlifecycle=7


Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	No default value
docId	Yes	The name of the document ID column. You can specify only one column.	No default value
docContent	Yes	The name of the document content column. You can specify only one column.	No default value
outputTableNameMulti	Yes	The name of the output table that lists the words in their original order after word segmentation.	No default value
outputTableNameTriple	No	The name of the output table that lists the number of times that each word appears in the documents.	No default value
inputTablePartitions	No	The partitions selected from the input table for training. The following formats are supported: Partition_name=value name1=value1/name2=value2: multi-level partitions Note If you specify multiple partitions, separate them with commas (,).	All partitions
lifecycle	No	The lifecycle of the output table. The value must be a positive integer.	-1

Usage notes

The outputTableNameMulti parameter specifies the output table that lists words in their original order in the documents after word segmentation. Word segmentation is performed based on docId and docContent.
The outputTableNameTriple parameter specifies the output table that lists the number of times that each word appears in the documents after word segmentation. Word segmentation is performed based on docId and docContent.