This topic describes how to perform distributed training based on the distributed training framework StarServer.

StarServer not only changes the Send/Recv semantics in native TensorFlow to the Pull/Push semantics, but also allows lock-free graph execution. This makes concurrent subgraph execution more efficient. PAI-TensorFlow supports large-scale training and provides high training performance. For typical business scenarios, the training performance of PAI-TensorFlow is several times better than that of native TensorFlow. If a test involves 3,000 workers, PAI-TensorFlow can achieve linear scalability.

Enable StarServer-based distributed training

To use StarServer for distributed training, you must add protocol="star_server" to tf.train.Server.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

server = tf.train.Server(cluster,
                         job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index,
                         protocol="star_server")