You can use the LLM-LaTeX Remove Bibliography (DLC) component to process TeX text data. The component removes the bibliography at the end of LaTeX text. The input Object Storage Service (OSS) data file must be in the JSON Lines format. Each line in the file is a valid JSON object, but the file as a whole is not a valid JSON object. You can click here to view an example.
Supported computing resources
Algorithm
This component extracts all strings that match the regular expression r'(\\appendix|\\begin\{references\}|\\begin\{REFERENCES\}|\\begin\{thebibliography\}|\\bibliography\{.*\}).*$' and replaces the strings with an empty string. Multiple match patterns are separated by vertical bars (|).
Example:
Before processing
| After processing
|
Configure the component
Configure the parameters of the LLM-LaTeX Remove Bibliography (DLC) component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.
Tab | Parameter | Required | Description | Default value | |
Fields Setting | Target Process Field | Yes | The name of the field that you want to process. | No default value | |
OSS Directory for Saving OutputData | No | The OSS directory in which the generated data is stored. If you do not specify this parameter, the default path of the workspace is used. | No default value | ||
Tuning | Number of Processes | No | The number of processes. | 8 | |
Select Resource Group | Public Resource Group | No | The instance type (CPU or GPU), number of instances, and a virtual private cloud (VPC) that you want to use. | No default value | |
Dedicated resource group | No | The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use. | No default value | ||
Maximum Running Duration (seconds) | No | The maximum period of time the component can run. If this period of time is exceeded, the job is terminated. | No default value | ||

