The topic model is a type of statistical model that is used to discover abstract topics from a collection of documents. In the Machine Learning Platform for AI, you can set the Topics parameter for the PLDA component to abstract different topics for each document.

Latent Dirichlet allocation (LDA) is a topic model that provides topics of each document based on probability distribution. LDA is an unsupervised learning algorithm. You only need to specify the number of topics in a document set by using K. You do not need to manually annotate training sets. K is the Topics parameter of the PLDA component.

LDA is a technique that was developed by David M. Blei, Andrew Y. Ng, and Michael I. Jordan in 2003. It is used to recognize texts, classify texts, and calculate the similarity between texts in the text mining field.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Table 1. Parameters
    Tab Parameter Description
    Fields Setting Feature Columns The feature columns that are used for training.
    Parameters Setting Topics The number of topics generated by LDA.
    Alpha The prior Dirichlet distribution parameter of P(z/d).
    beta The prior Dirichlet distribution parameter of P(w/z).
    Burn-in Iterations The number of burn-in iterations. The value of this parameter must be smaller than the total number of iterations. Default value: 100.
    Total Iterations Optional. The total number of iterations. The value must be a positive integer. Default value: 150.
  • Machine Learning Platform for AI command
    pai -name PLDA
        -project algo_public
        -DinputTableName=lda_input
        –DtopicNum=10
        -topicWordTableName=lda_output;
    Parameter Required Description Type Default value
    inputTableName Yes The name of the input table. STRING N/A
    inputTablePartitions No The partitions that are selected from the input table for training. The following formats are supported:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note Separate multiple partitions with commas (,).
    STRING All partitions of the input table
    selectedColNames No The names of the columns selected from the input table for LDA. STRING All columns of the input table.
    topicNum Yes The number of topics. Valid values: 2 to 500. Positive integer N/A
    kvDelimiter No The delimiter used to separate keys and values. Valid values:
    • Space
    • Comma (,)
    • Colon (:)
    STRING Colon (:)
    itemDelimiter No The delimiter used to separate keys. Valid values:
    • Space
    • Comma (,)
    • Colon (:)
    STRING Space
    alpha No The prior Dirichlet distribution parameter of P(z/d). Valid values: (0, ∞). FLOAT 0.1
    beta No The prior Dirichlet distribution parameter of P(w/z). Valid values: (0, ∞). FLOAT 0.01
    topicWordTableName Yes The name of the topic-word frequency contribution table. STRING N/A
    pwzTableName No The name of the P(w/z) output table. STRING The P(w/z) table is not generated.
    pzwTableName No The name of the P(z/w) output table. STRING The P(z/w) table is not generated.
    pdzTableName No The name of the P(d/z) table. STRING The P(d/z) table is not generated.
    pzdTableName No The name of the P(z/d) output table. STRING The P(z/d) table is not generated.
    pzTableName No The name of the P(z) output table. STRING The P(z) table is not generated.
    burnInIterations No The number of burn-in iterations. The value of this parameter must be smaller than the value of the totalIterations parameter. Positive integer 100
    totalIterations No The total number of iterations.
    Note z indicates the topic, w the word, and d the document.
    Positive integer 150
    enableSparse No Specifies whether the data in the input table is key-value pairs. The data can be key-value pairs or word segmentation results. Valid values:
    • true: key-value pairs
    • false: word segmentation results
    BOOL true
    coreNum No The parameter and the memSizePerCore parameter must be used in pair. By default, the system calculates the number of cores based on the amount of the input data. Default value: -1. Positive integer -1
    memSizePerCore No The memory size of each core. Unit: MB. A positive integer in the range of [1024, 64 × 1024] By default, the system automatically calculates the memory size of each node. Default value: -1. Positive integer -1

Input and output settings

  • Input

    The data must be in the format of a sparse matrix. You can use the Convert Row, Column, and Value to KV Pair component to convert the data.

    Figure 1 shows the input format.
    Figure 1. Input format
    Input format
    • Column 1: the ID of a document.
    • Column 2: key-value data of words and word frequencies.
  • The following output is returned:

    The following tables are generated in sequence: topic-word frequency contribution table, P(w/z) table, P(z/w) table, P(d/z) table, P(z/d) table, and P(z) table.

    Figure 2 shows the output format of the topic-word frequency contribution table.
    Figure 2. Output format
    Output format