All Products
Search
Document Center

Platform For AI:General environment variables

Last Updated:Mar 19, 2024

When you submit a training job in Deep Learning Containers (DLC) of Platform for AI (PAI), the system automatically injects multiple general environment variables that you can use in the code. This topic describes the environment variables that are provided in DLC.

Common environment variables

For information about the environment variables that are used for Lingjun AI Computing Service (Lingjun), see the "Configure high-performance network variables" section in the RDMA: high-performance networks for distributed training topic.

PyTorch environment variables

In distributed PyTorch training tasks, the master and worker nodes play different roles. You need to establish a connection between the nodes to allow communication. DLC provides environment variables to communicate necessary information, such as the address and port number of the master node. The following table describes the general environment variables for PyTorch training tasks in DLC.

Environment variable

Description

MASTER_ADDR

The service address of the master node. Example: dlc18isgeayd****-master-0.

MASTER_PORT

The port of the master node. Example: 23456.

WORLD_SIZE

The total number of nodes in the distributed training task. For example, if you submit a task that contains one master node and one worker node, the WORLD_SIZE parameter is set to 2.

RANK

The index of the node. For example, if you submit a job that contains one master node and two worker nodes, the RANK parameters of the master node, worker node-0, and worker node-1 are set to 0, 1, and 2, respectively.

TensorFlow environment variables

Distributed TensorFlow training tasks use the TF_CONFIG environment variable to build a distributed network topology. The following table describes the general environment variables for TensorFlow training tasks in DLC.

Environment variable

Description

TF_CONFIG

The distributed network topology of the TensorFlow training task. Example:

{
  "cluster": {
    "worker": [
      "dlc1y3madghd****-worker-0.t1612285282502324.svc:2222",
      "dlc1y3madghd****-worker-1.t1612285282502324.svc:2222"
    ]
  },
  "task": {
    "type": "worker",
    "index": 0
  },
  "environment": "cloud"
}