You can use the LLM-Length Filter (MaxCompute) component to preprocess the text data that is used to train large language models (LLMs). The component filters text samples based on text length, the average length of lines in the text, and the maximum line length. By default, if text samples are obtained based on the average length of lines in the text and the maximum line length, the text samples are split by line and then counted.
Limits
The LLM-Length Filter (MaxCompute) component supports only MaxCompute resources.
Configure the component
You can configure the parameters of the component in the Machine Learning Designer module of the Platform for AI (PAI) console. The following table describes the parameters.
Tab | Parameter | Required | Description | Default value |
Fields Setting | Select Target Column | Yes | The columns that you want to process. You can select multiple columns. | No default value |
Whether to Filter with Text Length | No |
| No default value | |
Whether to Filter with the Average Length of the Sample | No |
| No default value | |
Whether to Filter with the Longest Line Length of the Sample. | No |
| No default value | |
Output table lifecycle | No | The value is a positive integer. Unit: days. Default value: 28. After the table lifecycle elapses, the temporary tables generated by the component are recycled. | 28 | |
Tuning | Number of CPUs per instance of map task | No | The number of CPUs for each instance of a map task. Valid values: [50,800]. | 100 |
The memory size per instance of map task | No | The memory size of each instance of a map task. Unit: MB. Valid values: [256,12288]. | 1024 | |
The maximum size of input data for a map | No | The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: [1,Integer.MAX_VALUE]. | 256 |
References
For more information about Machine Learning Designer, see Overview of Machine Learning Designer.