You can use the LLM-LaTeX Remove Header (DLC) component to process TeX text data. The component removes the content before the first section that matches the <section-type>[optional-args]{name} format. The input Object Storage Service (OSS) data file must be in the JSON Lines format. Each line in the file is a valid JSON object, but the file as a whole is not a valid JSON object. You can click here to view an example.
Supported computing resources
Algorithm
This component uses the following regular expression to locate sections in a LaTeX text: r'^(.*?)(\\\bchapter\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bpart\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubsubsection\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bparagraph\b\*?(?:\[(.*?)\])?\{(.*?)\}|\\\bsubparagraph\b\*?(?:\[(.*?)\])?\{(.*?)\})'. Multiple match patterns are separated by vertical bars (|).
The component removes all the content before the first located section. The section line and the following content are retained. Example:
Before processing
| After processing
|
Configure the component
Configure the parameters of the LLM-LaTeX Remove Header (DLC) component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.
Tab | Parameter | Required | Description | Default value | |
Fields Setting | Target Process Field | Yes | The name of the field that you want to process. | No default value | |
Whether Remove no Header Sample | No | Specifies whether to delete text samples in which no sections are found. | Selected | ||
OSS Directory for Saving OutputData | No | The OSS directory in which the generated data is stored. If you do not specify this parameter, the default path of the workspace is used. | No default value | ||
Tuning | Number of Processes | No | The number of processes. | 8 | |
Select Resource Group | Public Resource Group | No | The instance type (CPU or GPU), number of instances, and a virtual private cloud (VPC) that you want to use. | No default value | |
Dedicated resource group | No | The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use. | No default value | ||
Maximum Running Duration (seconds) | No | The maximum period of time the component can run. If this period of time is exceeded, the job is terminated. | No default value | ||

