How to perform distributed training based on the StarServer framework - Platform For AI

Distributed training is a technique that splits the model training process across multiple compute nodes. This technique is used in deep learning and large-scale machine learning tasks to accelerate model training, process large amounts of data, and improve system stability and resource utilization. This topic describes how to use the StarServer framework for distributed training.

StarServer enhances the efficiency of concurrent subgraph execution by changing the Send/Recv semantics of native TensorFlow to the Pull/Push semantics and enabling lock-free graph execution. PAI-TensorFlow outperforms native TensorFlow by multiple times in common business scenarios. For example, PAI-TensorFlow can achieve approximate linear scalability when the number of workers is 3,000.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Sample code

To use StarServer for distributed training, add protocol="star_server" to the arguments of the tf.train.Server function.

cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

server = tf.train.Server(cluster,
                         job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index,
                         protocol="star_server")