Term Frequency-Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and text mining. TF-IDF is used by search engines as a tool in scoring and ranking the relevance of a document for a given search query.

Term frequency (TF) refers to the number of times that a given word appears in a document. Based on Inverse Document Frequency (IDF), a smaller number of documents that contains a given word indicates a higher IDF score of the word and a stronger capability of distinguishing the word.

TF-IDF is a statistical measure used to evaluate the importance of a word or document. Examples:
  • The importance of a word increases proportionally when the number of times that it appears in the document increases.
  • The importance of a word decreases when the number of times that it appears in the corpus increases.

The TF-IDF component is used to calculate the TF-IDF value of each word that appears in a collection of documents based on the output of the Word Frequency Statistics component. The calculation is not based on the documents.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Document ID Column You can set the parameter to id, which an output column of the Word Frequency Statistics component. Alternatively, you can process the original document to follow the output format of the Word Frequency Statistics component. For more information, see the sample output in Word Frequency Statistics.
    Word Column You can set the parameter to word, which an output column of the Word Frequency Statistics component. Alternatively, you can process the original document to follow the output format of the Word Frequency Statistics component. For more information, see the sample output in Word Frequency Statistics.
    Word Counting Column You can set the parameter to count, which an output column of the Word Frequency Statistics component. Alternatively, you can process the original document to follow the output format of the Word Frequency Statistics component. For more information, see the sample output in Word Frequency Statistics.
    Tuning Cores The number of cores used for calculation. The value is automatically calculated by default.
    Memory Size per Core The memory size of each core. Unit: MB.
  • Machine Learning Platform for AI command
    PAI -name tfidf
        -project algo_public
        -DinputTableName=rgdoc_split_triple_out
        -DdocIdCol=id
        -DwordCol=word
        -DcountCol=count
        -DoutputTableName=rg_tfidf_out;
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. N/A
    inputTablePartitions No The partitions that are selected from the input table for training.

    This value must be in the partition_name=value format. If you want to specify multiple levels of partitions, use the following format: name1=value1/name2=value2. Separate multiple partitions with commas (,).

    All partitions of the input table
    docIdCol Yes The name of the document ID column. Only one column can be specified. N/A
    wordCol Yes The name of the word column. Only one column can be specified. N/A
    countCol Yes The number of the word counting column. Only one column can be specified. N/A
    outputTableName Yes The name of the output table. N/A
    lifecycle No The lifecycle of the output table. The value must be a positive integer. Unit: days. N/A
    coreNum No The number of cores used for calculation. This parameter and the memSizePerCore parameter take effect only when they are both set. Automatically calculated
    memSizePerCore No The memory size of each core. This parameter and the coreNum parameter take effect only when they are both set. Automatically calculated