TF-IDF evaluates word importance within documents by combining term frequency and inverse document frequency.
Term Frequency (TF) counts how many times a word appears in a document. Inverse Document Frequency (IDF) indicates word importance. Words appearing in fewer documents have higher IDF values, indicating greater ability to distinguish between document categories.
TF-IDF evaluates word importance within documents or file sets. For example:
-
Word importance increases proportionally to frequency within a file set.
-
Word importance decreases proportionally to corpus frequency.
This component calculates TF-IDF values for each word in each document using output from the Word Frequency Statistics algorithm, not original documents.
Usage notes
TF-IDF requires output from the Word Frequency Statistics algorithm. Connect this component downstream of the Word Frequency Statistics component.
Configuration
Method 1: Designer UI
Add the TF-IDF component to your Designer workflow, then configure parameters in the right pane.
|
Parameter type |
Parameter |
Description |
|
Fields setting |
Document ID column |
Select the document ID column (the id column) output by the Word Count component, or process original documents into the required format. For details, see the output description in the Word Count example. |
|
Word column |
Select the word column (the word column) output by the Word Frequency component, or process original documents into the required format. For details, see the output description in the Word Frequency example. |
|
|
Word count column |
Select the word count column (the count column) output by the Word Frequency component, or process original documents into the required format. For details, see the output description in the Word Frequency example. |
|
|
Execution tuning |
Number of computing cores |
Number of workers. Calculated automatically by default. |
|
Memory per core |
Memory size of each worker, in MB. |
Method 2: PAI command
Configure component parameters using a PAI command. Use the SQL Script component to call PAI commands. For details, see SQL Script.
PAI -name tfidf
-project algo_public
-DinputTableName=rgdoc_split_triple_out
-DdocIdCol=id
-DwordCol=word
-DcountCol=count
-DoutputTableName=rg_tfidf_out;
|
Parameter |
Required |
Default value |
Description |
|
inputTableName |
Yes |
None |
Name of the input table. |
|
inputTablePartitions |
No |
All partitions of the input table |
Input table partitions to use for training. Use the format |
|
docIdCol |
Yes |
None |
Column name that identifies the document ID. Specify only one column. |
|
wordCol |
Yes |
None |
Name of the word column. Specify only one column. |
|
countCol |
Yes |
None |
Name of the count column. Specify only one column. |
|
outputTableName |
Yes |
None |
Name of the output table. |
|
lifecycle |
No |
None |
Lifecycle of the output table, in days. Must be a positive integer. |
|
coreNum |
No |
Calculated automatically |
Number of cores. Takes effect only when set together with memSizePerCore. |
|
memSizePerCore |
No |
Calculated automatically |
Memory size of each core. Takes effect only when set together with coreNum. |