All Products
Search
Document Center

Platform For AI:LLM-Length Filter (MaxCompute)

Last Updated:May 31, 2024

You can use the LLM-Length Filter (MaxCompute) component to preprocess the text data that is used to train large language models (LLMs). The component filters text samples based on text length, the average length of lines in the text, and the maximum line length. By default, if text samples are obtained based on the average length of lines in the text and the maximum line length, the text samples are split by line and then counted.

Limits

The LLM-Length Filter (MaxCompute) component supports only MaxCompute resources.

Configure the component

You can configure the parameters of the component in the Machine Learning Designer module of the Platform for AI (PAI) console. The following table describes the parameters.

Tab

Parameter

Required

Description

Default value

Fields Setting

Select Target Column

Yes

The columns that you want to process. You can select multiple columns.

No default value

Whether to Filter with Text Length

No

  • Text Separator: the delimiter that is used to split a text sample into a list. After the text sample is split, the component calculates the length of the list. By default, this parameter is left empty. In this case, the component directly calculates the length of a text sample without performing splitting. Enclose the delimiter in double quotation marks (").

  • Minimum Length: If the calculated length is less than the value of this parameter, the text sample is filtered out.

  • Maximal Length: If the calculated length is greater than the value of this parameter, the text sample is filtered out.

No default value

Whether to Filter with the Average Length of the Sample

No

  • Minimum average length: If the calculated average length is less than the value of this parameter, the text sample is filtered out.

  • Maximal average length: If the calculated average length is greater than the value of this parameter, the text sample is filtered out.

No default value

Whether to Filter with the Longest Line Length of the Sample.

No

  • Minimum length of the Longest Line: If the calculated maximum length is less than the value of this parameter, the text sample is filtered out.

  • Maximal length of the Longest Line: If the calculated maximum length is greater than the value of this parameter, the text sample is filtered out.

No default value

Output table lifecycle

No

The value is a positive integer. Unit: days. Default value: 28. After the table lifecycle elapses, the temporary tables generated by the component are recycled.

28

Tuning

Number of CPUs per instance of map task

No

The number of CPUs for each instance of a map task. Valid values: [50,800].

100

The memory size per instance of map task

No

The memory size of each instance of a map task. Unit: MB. Valid values: [256,12288].

1024

The maximum size of input data for a map

No

The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: [1,Integer.MAX_VALUE].

256

References

For more information about Machine Learning Designer, see Overview of Machine Learning Designer.