You can use the LLM-Sensitive Keywords Filter (MaxCompute) component to preprocess the text data that is used to train large language models (LLMs). The component filters text samples that contain sensitive keywords.
Supported computing resources
Algorithm description
The LLM-Sensitive Keywords Filter (MaxCompute) component checks whether the text sample contains sensitive keywords and filters out the text samples that contain sensitive keywords. The component can also return the detected sensitive keywords. By default, more than 12,000 sensitive keywords are supported.
Configure the component
On the pipeline details page in Machine Learning Designer, add the LLM-Sensitive Keywords Filter (MaxCompute) component to the pipeline and configure the parameters described in the following table.
Tab | Parameter | Default value | Description |
Fields Setting | Select Target Column | No default value | The columns that you want to process. |
Whether to Save the Sensitive Results | NoNo default value | Specifies whether to save the detection results to the output table. If you select this option, you can use the following parameters to specify the columns that store the detection results in the output table:
| |
SQL Script | No default value | The WHERE clause that specifies the filter condition. You can filter the samples based on the values of the Sensitive bool value saved column name and Sensitive words saved column name parameters. | |
Sensitive Keywords File | Default sensitive keyword file | The path of the sensitive keyword file. If you leave this parameter empty, the default sensitive keyword list is used. The file content must be in the "Sensitive keyword 1\nSensitive keyword 2\n..." format. Separate multiple sensitive keywords with line feeds. | |
Output table lifecycle | 28 | The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled. | |
Tuning | Number of CPUs per instance of map task | 100 | The number of CPUs for each instance of a map task. Valid values: 50 to 800. |
The memory size per instance of map task | 1024 | The memory size of each instance of a map task. Unit: MB. Valid values: 256 to 12288. | |
The maximum size of input data for a map | 256 | The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: 1 to Integer.MAX_VALUE. |