All Products
Search
Document Center

Platform For AI:LLM-LaTeX Expand Macro (DLC)

Last Updated:May 22, 2024

The LLM-LaTeX Expand Macro (DLC) component of Platform for AI (PAI) is used to preprocess TeX text data that is used to train large language models (LLMs). If a macro has no parameters and the macro name contains only letters and digits, the algorithm replaces the macro name with the macro definition for inline expansion. The input Object Storage Service (OSS) data file must be in the JSON Lines format and meet the following requirements: Each line in the file is a valid JSON object and the file consists of multiple lines of JSON objects, but the file is not a valid JSON object. For more information, see Example.

Supported computing resources

Deep Learning Containers (DLC)

Algorithm description

The LLM-LaTeX Expand Macro (DLC) component performs inline expansion on macros that match the following regular expressions:

Item

Parameterless macros defined by using \newcommand

Parameterless macros defined by using \def

Regular expressions

r'\\\bnewcommand\b\*?\{(\\[a-zA-Z0-9]+?)\}\{(.*?)\}$'

r'\\def\s*(\\[a-zA-Z0-9]+?)\s*\{(.*?)\}$'

Matched macros

\newcommand{\macro_name}{macro_value}

\newcommand*{\macro_name}{macro_value}

\def\macro_name{macro_value}

Note

The macro_name value can contain only letters and numbers, whereas macro_value can contain any characters.

If a macro matches the preceding regular expressions, the component replaces the value of macro_name with the value of macro_value. Example:

Before processing

image

After processing

image

Configure the component

On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-LaTeX Expand Macro (DLC) component.

Tab

Parameter

Required

Description

Default value

Fields Setting

Target Process Field

Yes

The name of the field that you want to process.

N/A

OSS Directory for Saving OutputData

No

The OSS directory in which the generated data is stored. If you do not specify this parameter, the default path of the workspace is used.

N/A

Tuning

Number of Processes

No

The number of processes.

8

Select Resource Group

Public Resource Group

No

The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) that you want to use.

N/A

Dedicated resource group

No

The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use.

N/A

Maximum Running Duration

No

The maximum period of time for which the component can run. If this period of time is exceeded, the job is terminated.

N/A