All Products
Search
Document Center

Platform For AI:TF-IDF

Last Updated:Mar 06, 2026

TF-IDF evaluates word importance within documents by combining term frequency and inverse document frequency.

Term Frequency (TF) counts how many times a word appears in a document. Inverse Document Frequency (IDF) indicates word importance. Words appearing in fewer documents have higher IDF values, indicating greater ability to distinguish between document categories.

TF-IDF evaluates word importance within documents or file sets. For example:

  • Word importance increases proportionally to frequency within a file set.

  • Word importance decreases proportionally to corpus frequency.

This component calculates TF-IDF values for each word in each document using output from the Word Frequency Statistics algorithm, not original documents.

Usage notes

TF-IDF requires output from the Word Frequency Statistics algorithm. Connect this component downstream of the Word Frequency Statistics component.

Configuration

Method 1: Designer UI

Add the TF-IDF component to your Designer workflow, then configure parameters in the right pane.

Parameter type

Parameter

Description

Fields setting

Document ID column

Select the document ID column (the id column) output by the Word Count component, or process original documents into the required format. For details, see the output description in the Word Count example.

Word column

Select the word column (the word column) output by the Word Frequency component, or process original documents into the required format. For details, see the output description in the Word Frequency example.

Word count column

Select the word count column (the count column) output by the Word Frequency component, or process original documents into the required format. For details, see the output description in the Word Frequency example.

Execution tuning

Number of computing cores

Number of workers. Calculated automatically by default.

Memory per core

Memory size of each worker, in MB.

Method 2: PAI command

Configure component parameters using a PAI command. Use the SQL Script component to call PAI commands. For details, see SQL Script.

PAI -name tfidf
    -project algo_public
    -DinputTableName=rgdoc_split_triple_out
    -DdocIdCol=id
    -DwordCol=word
    -DcountCol=count
    -DoutputTableName=rg_tfidf_out;

Parameter

Required

Default value

Description

inputTableName

Yes

None

Name of the input table.

inputTablePartitions

No

All partitions of the input table

Input table partitions to use for training.

Use the format partition_name=value. For multiple partition levels, use name1=value1/name2=value2. Separate multiple partitions with commas (,).

docIdCol

Yes

None

Column name that identifies the document ID. Specify only one column.

wordCol

Yes

None

Name of the word column. Specify only one column.

countCol

Yes

None

Name of the count column. Specify only one column.

outputTableName

Yes

None

Name of the output table.

lifecycle

No

None

Lifecycle of the output table, in days. Must be a positive integer.

coreNum

No

Calculated automatically

Number of cores. Takes effect only when set together with memSizePerCore.

memSizePerCore

No

Calculated automatically

Memory size of each core. Takes effect only when set together with coreNum.