This topic describes the Deprecated Word Filter component provided by Machine Learning Designer (formerly known as Machine Learning Studio).

The Deprecated Word Filter component is a preprocessing method in text analysis. This component is used to filter noise, such as "of", "is", or "oops", in word tokenization results.

The input of the component includes an input table and a deprecated word table. The input table contains deprecated words that you want to filter out. The deprecated word table has only one column. Each row has a deprecated word.

You can configure the component by using the Machine Learning Platform for AI (PAI) console or a PAI command.

Configure the component

You can use one of the following methods to configure the Deprecated Word Filter component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Deprecated Word Filter component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
TabParameterDescription
Fields SettingColumns to FilterThe columns to be filtered. Separate multiple columns with commas (,).
TuningCoresThe number of cores. By default, the system determines the value.
Memory SizeThe memory size of each core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name FilterNoise -project algo_public \
    -DinputTableName="test_input" -DnoiseTableName="noise_input" \
    -DoutputTableName="test_output" \
    -DselectedColNames="words_seg1,words_seg2" \
    -Dlifecycle=30
ParameterRequiredDescriptionDefault value
inputTableNameYesThe name of the input table. No default value
inputTablePartitionsNoThe names of the partitions in the input table. All partitions
noiseTableNameYesThe name of the deprecated word table. No default value
noiseTablePartitionsNoThe names of the partitions in the deprecated word table. All partitions
outputTableNameYesThe name of the output table. No default value
selectedColNamesYesThe columns to be filtered. Separate multiple columns with commas (,). No default value
lifecycleNoThe lifecycle of the output table. The value must be a positive integer. No default value
coreNumNoThe number of cores that are used in computing. Determined by the system
memSizePerCoreNoThe memory size of each core. Determined by the system