This topic describes the Stop Word Filter component in Designer.
The Stop Word Filter component is a pre-processing method in text analytics. It filters noise, such as "of", "is", or "a", from tokenization results.
The Stop Word Filter component takes two inputs: an input table and a stop word table. The input table contains the text to filter. The stop word table is a single-column table where each row is a stop word.
You can configure the Stop Word Filter component in Designer using the GUI or PAI commands.
Component configuration
You can configure the Stop Word Filter component in one of the following ways.
Method 1: Use the GUI
You can configure the component parameters on the workflow page in Designer.
|
Tab |
Parameter |
Description |
|
Fields Setting |
Column to Filter |
The column to filter. Separate multiple columns with commas (,). |
|
Execution Tuning |
Number of cores |
Automatically allocated by the system. |
|
Memory size |
Automatically allocated by the system. |
Method 2: Use a PAI command
You can use a PAI command to configure the component parameters. You can run PAI commands using the SQL Script component. For more information, see SQL Script.
PAI -name FilterNoise -project algo_public \
-DinputTableName=”test_input” -DnoiseTableName=”noise_input” \
-DoutputTableName=”test_output” \
-DselectedColNames=”words_seg1,words_seg2” \
-Dlifecycle=30
|
Parameter name |
Required |
Description |
Default value |
|
inputTableName |
Yes |
The name of the input tokenization table. |
None |
|
inputTablePartitions |
No |
Enter the partition name for the token table. |
All partitions |
|
noiseTableName |
Yes |
The name of the stop word table. |
None |
|
noiseTablePartitions |
No |
The name of the partition for the stopword list. |
All partitions |
|
outputTableName |
Yes |
The name of the output table. |
None |
|
selectedColNames |
Yes |
The columns to filter. Separate multiple columns with commas (,). |
None |
|
lifecycle |
No |
The lifecycle of the output table. The value must be a positive integer. |
None |
|
coreNum |
No |
The number of cores for the computation. |
Automatically allocated by the system. |
|
memSizePerCore |
No |
The memory size for each core. |
Automatically allocated by the system. |