This topic describes the Deprecated Word Filter component provided by Machine Learning Studio.
The Deprecated Word Filter component is a preprocessing method in text analysis. This component is used to filter noise (such as "of", "is", or "oops") in word tokenization results.
The input of the component includes an input table and deprecated word table. The data in the input table contains deprecated words. The deprecated word table has only one column. Each row has a deprecated word.
You can configure the component by using the Machine Learning Platform for AI console or a PAI command.
Configure the component
- Machine Learning Platform for AI console
Tab Parameter Description Fields Setting Columns The columns to be filtered. Separate multiple columns with commas (,). Tuning Cores Automatically allocated. Memory Size per Core Automatically allocated. - PAI command
PAI -name FilterNoise -project algo_public \ -DinputTableName="test_input" -DnoiseTableName="noise_input" \ -DoutputTableName="test_output" \ -DselectedColNames="words_seg1,words_seg2" \ -Dlifecycle=30
Parameter Required Description Default value inputTableName Yes The name of the input table. No default value inputTablePartitions No The names of the partitions in the input table. Full table noiseTableName Yes The name of the deprecated word table. No default value noiseTablePartitions No The names of the partitions in the deprecated word table. Full table outputTableName Yes The name of the output table. No default value selectedColNames Yes The columns to be filtered. Separate multiple columns with commas (,). No default value lifecycle No The lifecycle of the output table. The value must be a positive integer. No default value coreNum No The number of cores involved in computing. Automatically allocated memSizePerCore No The memory for each core. Automatically allocated