This topic describes the Deprecated Word Filter component provided by Machine Learning Studio.

The Deprecated Word Filter component is a preprocessing method in text analysis. This component is used to filter noise (such as "of", "is", or "oops") in word tokenization results.

The input of the component includes an input table and deprecated word table. The data in the input table contains deprecated words. The deprecated word table has only one column. Each row has a deprecated word.

You can configure the component by using the Machine Learning Platform for AI console or a PAI command.

Configure the component

  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Columns The columns to be filtered. Separate multiple columns with commas (,).
    Tuning Cores Automatically allocated.
    Memory Size per Core Automatically allocated.
  • PAI command
    PAI -name FilterNoise -project algo_public \
        -DinputTableName="test_input" -DnoiseTableName="noise_input" \
        -DoutputTableName="test_output" \
        -DselectedColNames="words_seg1,words_seg2" \
        -Dlifecycle=30
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    inputTablePartitions No The names of the partitions in the input table. Full table
    noiseTableName Yes The name of the deprecated word table. No default value
    noiseTablePartitions No The names of the partitions in the deprecated word table. Full table
    outputTableName Yes The name of the output table. No default value
    selectedColNames Yes The columns to be filtered. Separate multiple columns with commas (,). No default value
    lifecycle No The lifecycle of the output table. The value must be a positive integer. No default value
    coreNum No The number of cores involved in computing. Automatically allocated
    memSizePerCore No The memory for each core. Automatically allocated