Submit a DLC training job of the MPIJob type - Platform For AI

Deep Learning Containers (DLC) of Platform for AI (PAI) is an all-in-one cloud-native deep learning platform that provides a flexible, stable, easy-to-use, and high-performance training environment for machine learning. This topic describes how to use mpirun and DeepSpeed to submit a distributed training job of the MPIJob type in DLC.

Prerequisites

DLC is activated and the default workspace is created. For more information, see Activate PAI and create the default workspace.
Intelligent computing LINGJUN resources are purchased in the resource group and resource quotas are created. For more information, see Resource quota for intelligent computing LINGJUN resources.

Limits

You can submit an MPIJob training job by using LINGJUN resources only in the China (Ulanqab) region.

Submit an MPIJob training job

Perform the following steps to submit a distributed training job of the MPIJob type:

Step 1: Prepare a code source

Use the official DeepSpeed examples to create a dataset. Configure the required parameters. The following section describes the key parameters. Use the default settings for other parameters. For more information, see Code builds.

Name: the name of the dataset. In this example, deepspeed-examples is used.
Git repository: https://github.com/microsoft/DeepSpeedExamples.git.

Step 2: Submit a distributed training job

You can use one of the following methods to submit a distributed training job:

mpirun

Go to the Create Job page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). Click Create Job on the Distributed Training Jobs page. The Create Job page appears.
On the Create Job page, configure the required parameters. The following section describes the key parameters. For more information, see Submit training jobs.
- Node image: In this example, a test image named registry.cn-wulanchabu.aliyuncs.com/pai-dlc/deepspeed-training:23.08-gpu-py310-cu122-ubuntu22.04 is used to submit MPIJob training jobs.
- Job Command: Specify the command that is run on all nodes of the job. In this example, the default configurations of system environment variables are used. You can configure environment variables in the command to override the default configurations. For more information, see System environment variables.
```
cd /root/code/DeepSpeedExamples/training/cifar/

# --np 2 Start two nodes. 
mpirun -np 2 --allow-run-as-root -bind-to none -map-by slot -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python /root/code/DeepSpeedExamples/training/cifar/cifar10_tutorial.py
```
- Resource Quota: Select a resource quota that you created.
- Framework: Select MPIJob.
- Job Resource: Set Nodes parameter to 2, vCPUs to 4, GPUs to 1, Memory (GiB) to 8, and Shared Memory (GiB) to 8.
- Driver Settings: If you use the preceding test image, we recommend that you set the driver version to 535.54.03.
Click OK.

DeepSpeed (pdsh)

The following sample code provides an example of the configuration for the Job Command parameter. The instructions on how to configure other parameters are the same as the instructions in the mpirun method.

cd /root/code/DeepSpeedExamples/training/pipeline_parallelism

deepspeed --hostfile /etc/mpi/hostfile train.py --deepspeed_config=ds_config.json -p 2 --steps=200

Note

If you want to use a custom image to run DeepSpeed jobs, you must install the required libraries for MPIJob and DeepSpeed in the image. You can also obtain an official DeepSpeed image from DockerHub. The required libraries for MPIJob and DeepSpeed are pre-installed in the official image.

In this example, the default configurations of system environment variables are used. You can configure environment variables in the command to override the default configurations. For more information, see System environment variables.

Step 3: View job details and logs

After you submit the job, click the name of the job on the Distributed Training Jobs page.
On the Details page, view the basic information and running status of the job.
On the Instance tab in the lower part of the Details page, find the instance whose Type is launcher, and click Log in the Actions column to view the running status of the job.

System environment variables

MPI distributed jobs use two roles: launcher and worker. The roles need to communicate with each other during training. By default, DLC configures the environment variables for the launcher role. You can configure environment variables to override the default settings based on your business requirements. The following table describes the environment variables.

Environment variable	Description	Default value	Scenario
OMPI_MCA_btl_tcp_if_include	Specifies the network interface controller (NIC) for communication between the launcher and worker roles. Separate multiple NICs with commas (,).	eth0	Suitable for starting jobs by using mpirun.
OMPI_MCA_orte_default_hostfile	Specifies a host file for the mpirun command. The host file can be automatically generated in DLC.	`/etc/mpi/hostfile`
OMPI_MCA_plm_rsh_agent	Specifies how to remotely start worker jobs for the launcher role.	`/etc/mpi/kubexec.sh`
PDSH_RCMD_TYPE	The remote command type of PDSH.	ssh	Suitable for starting jobs by using DeepSpeed.