All Products
Search
Document Center

Platform For AI:TF-IDF

Last Updated:Dec 14, 2023

Term Frequency-Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and text mining. TF-IDF is used by search engines as a tool in scoring and ranking the relevance of a document for a given search query.

Term frequency (TF) refers to the number of times that a given word appears in a document. Based on Inverse Document Frequency (IDF), a smaller number of documents that contain a given word indicates a higher IDF score of the word and a stronger capability of distinguishing the word.

TF-IDF is a statistical measure used to evaluate the importance of a word or document. Examples:

  • The importance of a word increases proportionally when the number of times that it appears in the document increases.

  • The importance of a word decreases when the number of times that it appears in the corpus increases.

The TF-IDF component is used to calculate the TF-IDF value of each word that appears in a collection of documents based on the output of the Word Frequency Statistics component. The calculation is not based on the documents.

Usage notes

The TF-IDF component processes data generated by the Word Frequency Statistics component. As such, you must connect the Word Frequency Statistics component as an upstream node of the TF-IDF component.

Configure the component

You can configure the component by using one of the following methods:

Method 1: Configure the component in Machine Learning Designer

Configure the component on the pipeline configuration tab of Machine Learning Designer in the Machine Learning Platform for AI console.

Tab

Parameter

Description

Fields Setting

Document ID Column

You can set the parameter to id, which is an output column of the Word Frequency Statistics component. Alternatively, you can process the original document to follow the output format of the Word Frequency Statistics component. For more information, see the sample output in Word Frequency Statistics.

Word Column

You can set the parameter to word, which is an output column of the Word Frequency Statistics component. Alternatively, you can process the original document to follow the output format of the Word Frequency Statistics component. For more information, see the sample output in Word Frequency Statistics.

Word Counting Column

You can set the parameter to count, which is an output column of the Word Frequency Statistics component. Alternatively, you can process the original document to follow the output format of the Word Frequency Statistics component. For more information, see the sample output in Word Frequency Statistics.

Tuning

Cores

The number of cores used for calculation. The value is automatically calculated by default.

Memory Size per Core

The memory size per core. Unit: MB.

Method 2: Run Machine Learning Platform for AI commands

Configure the component parameters by using a Machine Learning Platform for AI command. You can use the SQL Script component to run Machine Learning Platform for AI commands. For more information, see SQL Script. The following table describes the parameters of the command that is used to configure this component.

PAI -name tfidf
    -project algo_public
    -DinputTableName=rgdoc_split_triple_out
    -DdocIdCol=id
    -DwordCol=word
    -DcountCol=count
    -DoutputTableName=rg_tfidf_out;

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

None

inputTablePartitions

No

The partitions that are selected from the input table for training.

This value must be in the partition_name=value format. If you want to specify multiple levels of partitions, use the following format: name1=value1/name2=value2. If you specify multiple partitions, separate them with commas (,).

All partitions

docIdCol

Yes

The name of the document ID column. You can specify only a single column.

None

wordCol

Yes

The name of the word column. You can specify only a single column.

None

countCol

Yes

The number of the word counting column. You can specify only a single column.

None

outputTableName

Yes

The name of the output table.

None

lifecycle

No

The lifecycle of the output table. The value must be a positive integer. Unit: days.

None

coreNum

No

The number of cores. This parameter and the memSizePerCore parameter take effect only when they are both set.

Determined by the system

memSizePerCore

No

The memory size of each core. This parameter and the coreNum parameter take effect only when they are both set.

Determined by the system