Word Frequency Statistics is a fundamental text analysis technique that quantifies text data by tallying the occurrences of each word within the text. These results are crucial for the feature extraction phase, laying the groundwork for further Natural Language Processing tasks, such as text classification, clustering, and information retrieval.
Algorithm description
Word frequency indicates how often a word appears in a given corpus, reflecting its significance in the text. To determine word frequency, the text (docContent) must first be segmented into individual words. Then, for each text, output its unique document ID (docId) along with the associated word data in the order they were input. Finally, calculate the frequency of each word in the specified text. This method not only uncovers the lexical structure of the text but also provides essential data support for further text analysis tasks, such as text classification, topic modeling, and information retrieval.
Input and output
Input port
Output port
Configure the component
Method 1: Visualized method
Add an Word Frequency Statistics component on the pipeline page and configure the following parameters:
Category | Parameter | Description |
Fields Setting | Document ID Column | The column that contains the IDs of the specified documents (docId). |
Document Content Column | The column that contains the content of the specified documents (docContent). The text in this column are used for word frequency statistical analysis, which includes segmentation and frequency calculation for each word. | |
Tuning | Cores | The number of cores to use. |
Memory Size per Core | The memory size of each core. Unit: MB. |
Method 2: PAI command method
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
pai -name doc_word_stat
-project algo_public
-DinputTableName=tdl_doc_test_split_word
-DdocId=docid
-DdocContent=content
-DoutputTableNameMulti=doc_test_stat_multi
-DoutputTableNameTriple=doc_test_stat_triple
-DinputTablePartitions="region=cctv_news"
-Dlifecycle=7Parameter | Required | Default value | Description |
inputTableName | Yes | None | The name of the input table. |
docId | Yes | None | The name of the document ID column. You can specify only one column. |
docContent | Yes | None | The name of the document content column. You can specify only one column. |
outputTableNameMulti | Yes | None | The name of the output table that lists the words in their original order after word segmentation, including the document ID column (docId) and the corresponding document content (docContent). |
outputTableNameTriple | No | None | The name of the output table that lists the number of times that each word appears in the documents, including the document ID column (docId) and the corresponding document content (docContent). |
inputTablePartitions | No | All partitions | The partitions selected from the input table for training. The following formats are supported:
Note If you specify multiple partitions, separate them with commas (,). For example, name1=value1,value2. |
lifecycle | No | -1 | The lifecycle of the output table. The value must be a positive integer. |