The LLM-Special Characters Ratio Filter (MaxCompute) component is used to filter text samples based on the ratio of special characters in the text. You can use the component during text preprocessing of large language models (LLMs).
Limits
The LLM-Special Characters Ratio Filter (MaxCompute) component supports only MaxCompute resources.
Algorithm
The algorithm traverses each character in the text and calculates the ratio of the number of special characters to the length of the text.
The special characters include the following types: punctuation (string.punctuation), digits (string.digits), spaces (string.whitespace), emojis, and other special characters.
Configure the component
You can configure the parameters of the LLM-Special Characters Ratio Filter (MaxCompute) component in Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Required | Description | Default value |
Fields Setting | Select Target Column | Yes | The columns that you want to process. You can select multiple columns. | No default value |
Minimum Ratio | No | If the ratio of the number of special characters to the length of the text is smaller than this value, the text is filtered out. | 0 | |
Maximum Ratio | Yes | If the ratio of the number of special characters to the length of the text is greater than this value, the text is filtered out. | No default value | |
Output table lifecycle | No | The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled. | 28 | |
Tuning | Number of CPUs per instance of map task | No | The number of CPUs for each instance of a map task. Valid values: 50 to 800. | 100 |
The memory size per instance of map task | No | The memory size of each instance of a map task. Unit: MB. Valid values: 256 to 12288. | 1024 | |
The maximum size of input data for a map | No | The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: 1 to Integer.MAX_VALUE. | 256 |
References
For more information about Machine Learning Designer, see Overview of Machine Learning Designer.