You can use the LLM-Count Filter (MaxCompute) component to preprocess the text data that is used to train large language models (LLMs). The component filters text samples based on metrics related to letters, digits, or delimiters.
Limits
The LLM-Count Filter (MaxCompute) component supports only MaxCompute resources.
Algorithm
The LLM-Count Filter (MaxCompute) component filters text samples based on the following metrics:
The number of digits or the ratio of digits to total characters
If you specify a delimiter, the component splits the text samples into word lists and calculates the metric values by words.
The number of letters or the ratio of letters to total characters
If you specify a delimiter, the component splits the text samples into word lists and calculates the metric values by words.
The number of alphanumeric characters or the ratio of alphanumeric characters to total characters
If you specify a delimiter, the component splits the text samples into word lists and calculates the metric values by words.
The ratio of letters to total tokens
The component uses the pythia-6.9b-deduped model to split a text sample into tokens and divides the number of letters by the total number of tokens to calculate the ratio.
The number of delimiters
Configure the component
You can configure the parameters of the LLM-Count Filter (MaxCompute) component in the Machine Learning Designer module of the Platform for AI (PAI) console. The following table describes the parameters.
Tab | Parameter | Required | Description | Default value |
Fields Setting | Select Target Column | Yes | The column that you want to process. You can select multiple columns. | No default value |
Text Separator | No | The delimiter that is used to split a text sample into word lists. After the split, the metric value is calculated by words. If you leave this parameter empty, the component calculates the metric value by characters. Enclose the delimiter in double quotation marks (""). | " " | |
Whether to Filter with Numeric Count or Ratio | No |
| No default value | |
Whether to Filter with Alpha Count or Ratio | No |
| No default value | |
Whether to Filter with AlphaNumeric Count or Ratio | No |
| No default value | |
Whether to Filter with the Ratio of the Number of alpha chars to the Number of Text Tokens | No |
| No default value | |
Whether to Filter with Separator Count | No |
| No default value | |
Output table lifecycle | No | The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled. | 28 | |
Tuning | Number of CPUs per instance of map task | No | The number of CPUs for each instance of a map task. Valid values: [50,800]. | 100 |
The memory size per instance of map task | No | The memory size of each instance of a map task. Unit: MB. Valid values: [256,12288]. | 1024 | |
The maximum size of input data for a map | No | The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: [1,Integer.MAX_VALUE]. | 256 |
References
For more information about Machine Learning Designer, see Overview of Machine Learning Designer.