All Products
Search
Document Center

Platform For AI:General environment variables

Last Updated:May 08, 2025

When you submit a training job in Deep Learning Containers (DLC) of Platform for AI (PAI), the system automatically injects multiple general environment variables that you can use in the code. This topic describes the environment variables that are provided in DLC.

Common environment variables

For information about the environment variables that are used for Lingjun AI Computing Service (Lingjun), see the "Configure high-performance network variables" section in the RDMA: high-performance networks for distributed training topic.

PyTorch environment variables

In distributed PyTorch training jobs, the master and worker nodes play different roles. You need to establish a connection between the nodes to allow communication. DLC provides environment variables to communicate necessary information, such as the address and port number of the master node. The following table describes the general environment variables for PyTorch training jobs in DLC.

Environment variable

Description

MASTER_ADDR

The service address of the master node. Example: dlc18isgeayd****-master-0.

MASTER_PORT

The port of the master node. Example: 23456.

WORLD_SIZE

The total number of nodes in the distributed training job. For example, if you submit a task that contains one master node and one worker node, the WORLD_SIZE parameter is set to 2.

RANK

The index of the node. For example, if you submit a job that contains one master node and two worker nodes, the RANK parameters of the master node, worker node-0, and worker node-1 are set to 0, 1, and 2, respectively.

NPROC_PER_NODE

The number of GPUs for each worker node. For example, if the GPU specification of a worker node contains 8 GPUs of the GU7E type, the value of this parameter is 8.

TensorFlow environment variables

Distributed TensorFlow training jobs use the TF_CONFIG environment variable to build a distributed network topology. The following table describes the general environment variables for TensorFlow training jobs in DLC.

Environment variable

Description

TF_CONFIG

The distributed network topology of the TensorFlow training job. Example:

{
  "cluster": {
    "worker": [
      "dlc1y3madghd****-worker-0.t1612285282502324.svc:2222",
      "dlc1y3madghd****-worker-1.t1612285282502324.svc:2222"
    ]
  },
  "task": {
    "type": "worker",
    "index": 0
  },
  "environment": "cloud"
}