All Products
Search
Document Center

Platform For AI:LLM-LaTeX Expand Macro (MaxCompute)

Last Updated:Jul 10, 2024

You can use the LLM-LaTeX Expand Macro (MaxCompute) component of Platform for AI (PAI) to preprocess TeX text data that is used to train large language models (LLMs). If a macro has no parameters and the macro name contains only letters and numbers, the component replaces the macro name with the macro definition for inline expansion.

Supported computing resources

MaxCompute

Algorithm description

The LLM-LaTeX Expand Macro (MaxCompute) component performs inline expansion on macros that match the following regular expressions:

Item

Parameterless macros defined by using \newcommand

Parameterless macros defined by using \def

Regular expression

r'\\\bnewcommand\b\*?\{(\\[a-zA-Z0-9]+?)\}\{(.*?)\}$'

r'\\def\s*(\\[a-zA-Z0-9]+?)\s*\{(.*?)\}$'

Matched macros

\newcommand{\macro_name}{macro_value}

\newcommand*{\macro_name}{macro_value}

\def\macro_name{macro_value}

Note

macro_name can contain only letters and numbers, and macro_value can contain any characters.

If a macro matches the preceding regular expressions, the component replaces the value of macro_name with the value of macro_value. Example:

Before processing

image

After processing

image

Configure the component

To configure the component in the PAI console, perform the following steps: Log on to the PAI console, go to the Visualized Modeling (Designer) page, and then open a pipeline. On the pipeline page, drag the LLM-LaTeX Expand Macro (MaxCompute) component to the canvas and configure the parameters in the right-side pane. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Select Target Column

The column that you want to process. You can select multiple columns.

Output table lifecycle

The value must be a positive integer. Unit: days. Default value: 28. The temporary table generated by this component is recycled after 28 days.

Tuning

Number of CPUs per instance of map task

The number of CPUs for each instance of a map task. Valid values: 50 to 800. Default value: 100.

The memory size per instance of map task

The memory size of each instance of a map task. Valid values: 256 to 12288. Default value: 1024. Unit: MB.

The maximum size of input data for a map

The maximum amount of data that each instance of a map task can process. Valid values: 1 to Integer.MAX_VALUE. Default value: 256. Unit: MB.

You can use this parameter to control the size of the input data.