Built-in environment variables - Platform For AI - Alibaba Cloud Documentation Center

When you submit a training job in Deep Learning Containers (DLC) of Platform for AI (PAI), DLC automatically injects environment variables that you can use in the code.

PyTorch environment variables

In a distributed PyTorch training job, all nodes must communicate through the master node. DLC injects the following variables so each node can discover the master's address and understand its position in the cluster.

Variable	Description	Example
`MASTER_ADDR`	Service address of the master node	`dlc18isgeayd****-master-0`
`MASTER_PORT`	Port of the master node	`23456`
`WORLD_SIZE`	Total number of nodes in the job	`2` (1 master + 1 worker)
`RANK`	Index of this node across the entire job	`0` for master; `1`, `2` for worker-0, worker-1 (1 master + 2 workers)
`NPROC_PER_NODE`	Number of GPUs for each worker node	`8` for a GU7E node with 8 GPUs

TensorFlow environment variables

Distributed TensorFlow training uses TF_CONFIG to describe the full cluster topology and identify the current task. DLC sets this variable on every node automatically.

Variable	Description
`TF_CONFIG`	JSON string describing the distributed network topology, including the cluster worker list and the task identity of the current node

Example value (for worker-0 in a two-worker job):

{
  "cluster": {
    "worker": [
      "dlc1y3madghd****-worker-0.t1612285282502324.svc:2222",
      "dlc1y3madghd****-worker-1.t1612285282502324.svc:2222"
    ]
  },
  "task": {
    "type": "worker",
    "index": 0
  },
  "environment": "cloud"
}

The cluster.worker array lists all workers in the job. The task object identifies this node: type is its role and index is its zero-based position in the worker list.

Lingjun high-performance network variables

For environment variables used with Lingjun AI Computing Service (Lingjun), see the "Configure high-performance network variables" section in RDMA: high-performance networks for distributed training.