This topic describes how to perform distributed training based on the distributed training framework StarServer.
StarServer not only changes the Send/Recv semantics in native TensorFlow to the Pull/Push semantics, but also allows lock-free graph execution. This makes concurrent subgraph execution more efficient. PAI-TensorFlow supports large-scale training and provides high training performance. For typical business scenarios, the training performance of PAI-TensorFlow is several times better than that of native TensorFlow. If a test involves 3,000 workers, PAI-TensorFlow can achieve linear scalability.
Enable StarServer-based distributed training
protocol="star_server"
to tf.train.Server.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index,
protocol="star_server")