All Products
Search
Document Center

Platform For AI:Submit an MPIJob training job

Last Updated:Dec 24, 2024

Deep Learning Containers (DLC) of Platform for AI (PAI) is an all-in-one cloud-native deep learning platform that provides a flexible, stable, easy-to-use, and high-performance training environment for machine learning. This topic describes how to use mpirun and DeepSpeed to submit a distributed training job of the MPIJob type in DLC.

Prerequisites

Limits

You can submit an MPIJob training job by using Lingjun resources only in the China (Ulanqab) region.

Submit an MPIJob training job

Perform the following steps to submit a distributed training job of the MPIJob type:

Step 1: Prepare a code source

Use the official DeepSpeed examples to create a dataset. Configure the required parameters. The following section describes the key parameters. Use the default settings for other parameters. For more information, see Code builds.

Step 2: Submit a distributed training job

You can use one of the following methods to submit a distributed training job:

mpirun

  1. Go to the Create Job page.

    1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. On the Create Job page, configure the key parameters described in the following table. For more information about other parameters, see Submit training jobs.

    Parameter

    Description

    Environment Information

    Node Image

    In this example, a test image is provided and used to submit a distributed training job of the MPIJob type. Click Image Address and enter registry.cn-wulanchabu.aliyuncs.com/pai-dlc/deepspeed-training:23.08-gpu-py310-cu122-ubuntu22.04 in the field.

    Startup Command

    The command that is run on all nodes of the distributed training job. In this example, the default configurations of system environment variables are used. You can configure environment variables in the command to overwrite the default configurations. For more information, see System environment variables.

    cd /root/code/DeepSpeedExamples/training/cifar/
    
    # --np 2 Start two nodes. 
    mpirun -np 2 --allow-run-as-root -bind-to none -map-by slot -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python /root/code/DeepSpeedExamples/training/cifar/cifar10_tutorial.py

    Code Builds

    Select Online configuration, and select the code that you created. Retain the default value of Mount Path.

    Resource Information

    Resource Type

    Select Lingjun AI Computing Service.

    Note

    This parameter is available only when the workspace supports submitting a DLC task by using both Lingjun resources and general resources.

    Source

    Select Resource Quota.

    Resource Quota

    Select a Lingjun resource quota that you created.

    Framework

    Select MPIJob.

    Job Resource

    • Set Nodes to 2.

    • Set vCPUs to 4.

    • Set GPUs to 1.

    • Set Memory (GiB) to 8.

    • Set Shared Memory (GiB) to 8.

    Driver Settings

    If you use the preceding test image, we recommend that you set the driver version to 535.54.03.

    Note

    Only Lingjun resources support this parameter.

  3. Click Confirm.

Deepspeed (pdsh)

If you use this method to submit a distributed training job, use the following code for Startup Command. The configurations of other parameters are the same as the configurations for the mpirun method.

cd /root/code/DeepSpeedExamples/training/pipeline_parallelism

deepspeed --hostfile /etc/mpi/hostfile train.py --deepspeed_config=ds_config.json -p 2 --steps=200
Note

If you want to use a custom image to run DeepSpeed jobs, you must install the required libraries for MPIJob and DeepSpeed in the image. You can also obtain an official DeepSpeed image from DockerHub. The required libraries for MPIJob and DeepSpeed are pre-installed in the image.

In this example, the default configurations of system environment variables are used. You can also configure environment variables in the startup command to overwrite the default configurations. For more information, see System environment variables.

Step 3: View job details and logs

  1. After the job is submitted, click the name of the job on the Deep Learning Containers (DLC) page.

  2. On the job details page, view the basic information and running status of the job.

  3. In the Instance section at the bottom of the job details page, find the instance whose type is launcher, and click Log in the Actions column to view the running status of the job.image

System environment variables

MPI distributed jobs use two roles: launcher and worker. The roles need to communicate with each other during training. By default, DLC configures the environment variables for the launcher role. You can configure environment variables to overwrite the default configurations based on your business requirements. The following table describes the environment variables.

Environment variable

Description

Default value

Scenario

OMPI_MCA_btl_tcp_if_include

Specifies the network interface controller (NIC) for communication between the launcher and worker roles. Separate multiple NICs with commas (,).

eth0

Suitable for starting jobs by using mpirun.

OMPI_MCA_orte_default_hostfile

Specifies a host file for the mpirun command. The host file can be automatically generated in DLC.

/etc/mpi/hostfile

OMPI_MCA_plm_rsh_agent

Specifies how to remotely start worker jobs for the launcher role.

/etc/mpi/kubexec.sh

PDSH_RCMD_TYPE

The remote command type of PDSH.

ssh

Suitable for starting jobs by using DeepSpeed.