All Products
Search
Document Center

Platform For AI:LLM-LaTeX Remove Header (MaxCompute)

Last Updated:Jan 02, 2025

You can use the LLM-LaTeX Remove Header (MaxCompute) component to preprocess TeX text data that is used to train large language models (LLMs). The component removes the content before the first section that matches the <section-type>[optional-args]{name} format.

Supported computing resources

MaxCompute

Algorithm

The LLM-LaTeX Remove Header (MaxCompute) component uses the following regular expression to locate sections in a LaTeX text: r'^(.*?)(\\\bchapter\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bpart\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubsubsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bparagraph\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubparagraph\b\*?(?:\[(.*?)\])?\{(.*?)\})'. Multiple match patterns are separated by vertical bars (|).

The component removes all the content before the first located section. The section line and the following content are retained. Example:

Before processing

image

After processing

image

Configure the component

Configure the parameters of the LLM-LaTeX Remove Header (MaxCompute) component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Select Target Column

The columns that you want to process. You can select multiple columns.

Whether Remove no Header Sample

Specifies whether to delete text samples in which no sections are found.

Output table lifecycle

The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled.

Tuning

Number of CPUs per instance of map task

The number of CPUs for each instance of a map task. Valid values: 50 to 800. Default value: 100.

The memory size per instance of map task

The memory size of each instance of a map task. Valid values: 256 to 12288. Default value: 1024. Unit: MB.

The maximum size of input data for a map

The maximum amount of data that each instance of a map task can process. Valid values: 1 to Integer.MAX_VALUE. Default value: 256. Unit: MB.

You can use this parameter to control the size of the input data.