how to perform distributed training based on the distributed communication framework gRPC++ - Platform For AI

You can enable gRPC++-based distributed training to accelerate deep learning trainings in Deep Learning Containers (DLC) of Platform for AI (PAI). This topic describes how to perform distributed training based on the distributed communication framework gRPC++.

To support larger-scale training and provide better performance, gRPC++ uses multiple optimization technologies to reduce E2E communication latency and improve server throughput. The technologies include the Sharing Nothing architecture, BusyPolling mechanism, user-mode zero-copy, and Send/Recv integration. For typical business scenarios, the training performance of gRPC++ is several times better than that of native TensorFlow.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Enable gRPC++-based distributed training

To use gRPC++ for distributed training, you must add protocol="grpc++" to tf.train.Server.

cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

server = tf.train.Server(cluster,
                         job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index,
                         protocol="grpc++")