When you submit a training job in Deep Learning Containers (DLC) of Platform for AI (PAI), DLC automatically injects environment variables that you can use in the code.
PyTorch environment variables
In a distributed PyTorch training job, all nodes must communicate through the master node. DLC injects the following variables so each node can discover the master's address and understand its position in the cluster.
| Variable | Description | Example |
|---|---|---|
MASTER_ADDR | Service address of the master node | dlc18isgeayd****-master-0 |
MASTER_PORT | Port of the master node | 23456 |
WORLD_SIZE | Total number of nodes in the job | 2 (1 master + 1 worker) |
RANK | Index of this node across the entire job | 0 for master; 1, 2 for worker-0, worker-1 (1 master + 2 workers) |
NPROC_PER_NODE | Number of GPUs for each worker node | 8 for a GU7E node with 8 GPUs |
TensorFlow environment variables
Distributed TensorFlow training uses TF_CONFIG to describe the full cluster topology and identify the current task. DLC sets this variable on every node automatically.
| Variable | Description |
|---|---|
TF_CONFIG | JSON string describing the distributed network topology, including the cluster worker list and the task identity of the current node |
Example value (for worker-0 in a two-worker job):
{
"cluster": {
"worker": [
"dlc1y3madghd****-worker-0.t1612285282502324.svc:2222",
"dlc1y3madghd****-worker-1.t1612285282502324.svc:2222"
]
},
"task": {
"type": "worker",
"index": 0
},
"environment": "cloud"
}The cluster.worker array lists all workers in the job. The task object identifies this node: type is its role and index is its zero-based position in the worker list.
Lingjun high-performance network variables
For environment variables used with Lingjun AI Computing Service (Lingjun), see the "Configure high-performance network variables" section in RDMA: high-performance networks for distributed training.