All Products
Search
Document Center

Platform For AI:Word Frequency Statistics

Last Updated:Nov 28, 2024

Word Frequency Statistics is a fundamental text analysis technique that quantifies text data by tallying the occurrences of each word within the text. These results are crucial for the feature extraction phase, laying the groundwork for further Natural Language Processing tasks, such as text classification, clustering, and information retrieval.

Algorithm description

Word frequency indicates how often a word appears in a given corpus, reflecting its significance in the text. To determine word frequency, the text (docContent) must first be segmented into individual words. Then, for each text, output its unique document ID (docId) along with the associated word data in the order they were input. Finally, calculate the frequency of each word in the specified text. This method not only uncovers the lexical structure of the text but also provides essential data support for further text analysis tasks, such as text classification, topic modeling, and information retrieval.

Input and output

Input port

Split Word

Output port

Configure the component

Method 1: Visualized method

Add an Word Frequency Statistics component on the pipeline page and configure the following parameters:

Category

Parameter

Description

Fields Setting

Document ID Column

The column that contains the IDs of the specified documents (docId).

Document Content Column

The column that contains the content of the specified documents (docContent). The text in this column are used for word frequency statistical analysis, which includes segmentation and frequency calculation for each word.

Tuning

Cores

The number of cores to use.

Memory Size per Core

The memory size of each core. Unit: MB.

Method 2: PAI command method

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

pai -name doc_word_stat
    -project algo_public
    -DinputTableName=tdl_doc_test_split_word
    -DdocId=docid
    -DdocContent=content
    -DoutputTableNameMulti=doc_test_stat_multi
    -DoutputTableNameTriple=doc_test_stat_triple
    -DinputTablePartitions="region=cctv_news"
    -Dlifecycle=7

Parameter

Required

Default value

Description

inputTableName

Yes

None

The name of the input table.

docId

Yes

None

The name of the document ID column. You can specify only one column.

docContent

Yes

None

The name of the document content column. You can specify only one column.

outputTableNameMulti

Yes

None

The name of the output table that lists the words in their original order after word segmentation, including the document ID column (docId) and the corresponding document content (docContent).

outputTableNameTriple

No

None

The name of the output table that lists the number of times that each word appears in the documents, including the document ID column (docId) and the corresponding document content (docContent).

inputTablePartitions

No

All partitions

The partitions selected from the input table for training. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions.

Note

If you specify multiple partitions, separate them with commas (,). For example, name1=value1,value2.

lifecycle

No

-1

The lifecycle of the output table. The value must be a positive integer.