All Products
Search
Document Center

Platform For AI:LLM-Sensitive Keywords Filter (MaxCompute)

Last Updated:Jan 06, 2025

You can use the LLM-Sensitive Keywords Filter (MaxCompute) component to preprocess the text data that is used to train large language models (LLMs). The component filters text samples that contain sensitive keywords.

Supported computing resources

MaxCompute

Algorithm description

The LLM-Sensitive Keywords Filter (MaxCompute) component checks whether the text sample contains sensitive keywords and filters out the text samples that contain sensitive keywords. The component can also return the detected sensitive keywords. By default, more than 12,000 sensitive keywords are supported.

Configure the component

On the pipeline details page in Machine Learning Designer, add the LLM-Sensitive Keywords Filter (MaxCompute) component to the pipeline and configure the parameters described in the following table.

Tab

Parameter

Default value

Description

Fields Setting

Select Target Column

No default value

The columns that you want to process.

Whether to Save the Sensitive Results

NoNo default value

Specifies whether to save the detection results to the output table. If you select this option, you can use the following parameters to specify the columns that store the detection results in the output table:

  • Sensitive bool value saved column name: the name of the column that specifies whether sensitive keywords are detected. The column is of the BOOL type. Default value: is_sensitive.

  • Sensitive words saved column name: the name of the column that stores the detected sensitive keywords. Default value: sensitive_words.

SQL Script

No default value

The WHERE clause that specifies the filter condition. You can filter the samples based on the values of the Sensitive bool value saved column name and Sensitive words saved column name parameters. If you modify the column names, configure the WHERE clause in the SQL Script field based on the modified column names. Default value: where not is_sensitive.

Sensitive Keywords File

Default sensitive keyword file

The path of the sensitive keyword file. If you leave this parameter empty, the default sensitive keyword list is used. The file content must be in the "Sensitive keyword 1\nSensitive keyword 2\n..." format. Separate multiple sensitive keywords with line feeds.

Output table lifecycle

28

The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled.

Tuning

Number of CPUs per instance of map task

100

The number of CPUs for each instance of a map task. Valid values: 50 to 800.

The memory size per instance of map task

1024

The memory size of each instance of a map task. Unit: MB. Valid values: 256 to 12288.

The maximum size of input data for a map

256

The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: 1 to Integer.MAX_VALUE.