Configure the TF-IDF component - Platform For AI

TF-IDF evaluates word importance within documents by combining term frequency and inverse document frequency.

Term Frequency (TF) counts how many times a word appears in a document. Inverse Document Frequency (IDF) indicates word importance. Words appearing in fewer documents have higher IDF values, indicating greater ability to distinguish between document categories.

TF-IDF evaluates word importance within documents or file sets. For example:

Word importance increases proportionally to frequency within a file set.
Word importance decreases proportionally to corpus frequency.

This component calculates TF-IDF values for each word in each document using output from the Word Frequency Statistics algorithm, not original documents.

Usage notes

TF-IDF requires output from the Word Frequency Statistics algorithm. Connect this component downstream of the Word Frequency Statistics component.

Configuration

Method 1: Designer UI

Add the TF-IDF component to your Designer workflow, then configure parameters in the right pane.

Parameter type	Parameter	Description
Fields setting	Document ID column	Select the document ID column (the id column) output by the Word Count component, or process original documents into the required format. For details, see the output description in the Word Count example.
	Word column	Select the word column (the word column) output by the Word Frequency component, or process original documents into the required format. For details, see the output description in the Word Frequency example.
	Word count column	Select the word count column (the count column) output by the Word Frequency component, or process original documents into the required format. For details, see the output description in the Word Frequency example.
Execution tuning	Number of computing cores	Number of workers. Calculated automatically by default.
Execution tuning	Memory per core	Memory size of each worker, in MB.

Method 2: PAI command

Configure component parameters using a PAI command. Use the SQL Script component to call PAI commands. For details, see SQL Script.

PAI -name tfidf
    -project algo_public
    -DinputTableName=rgdoc_split_triple_out
    -DdocIdCol=id
    -DwordCol=word
    -DcountCol=count
    -DoutputTableName=rg_tfidf_out;

Parameter	Required	Default value	Description
inputTableName	Yes	None	Name of the input table.
inputTablePartitions	No	All partitions of the input table	Input table partitions to use for training. Use the format `partition_name=value`. For multiple partition levels, use `name1=value1/name2=value2`. Separate multiple partitions with commas (,).
docIdCol	Yes	None	Column name that identifies the document ID. Specify only one column.
wordCol	Yes	None	Name of the word column. Specify only one column.
countCol	Yes	None	Name of the count column. Specify only one column.
outputTableName	Yes	None	Name of the output table.
lifecycle	No	None	Lifecycle of the output table, in days. Must be a positive integer.
coreNum	No	Calculated automatically	Number of cores. Takes effect only when set together with memSizePerCore.
memSizePerCore	No	Calculated automatically	Memory size of each core. Takes effect only when set together with coreNum.