All Products
Search
Document Center

Container Service for Kubernetes:Overview of running model training jobs on Kubernetes

Last Updated:Mar 06, 2025

Run standalone TensorFlow training jobs

You can run standalone TensorFlow training jobs in Container Service for Kubernetes (ACK) clusters. ACK provides resource management capabilities that allow you to quickly deploy and run standalone TensorFlow training jobs. This topic describes how to create training jobs, configure resources, and run training jobs. You can refer to this topic to easily get started with standalone TensorFlow training jobs. For more information, see Use Arena to submit standalone TensorFlow training jobs in a Kubernetes cluster.

Run distributed TensorFlow training jobs

You can run distributed TensorFlow training jobs in ACK clusters. You can utilize the parallel computing capabilities provided by ACK based on multiple computing nodes to improve the speed and efficiency of distributed training jobs. This topic introduces the basic terms related to distributed model training and describes how to configure a cluster for distributed model training and how to run distributed TensorFlow training jobs in ACK clusters. You can refer to this topic to optimize the performance of distributed TensorFlow training jobs. For more information, see Use Arena to submit distributed TensorFlow training jobs in a Kubernetes cluster.

Use Arena to submit standalone PyTorch training jobs

Arena is a tool designed to simplify machine learning (ML) task submission. You can use Arena to submit standalone PyTorch training jobs on Kubernetes. This topic describes how to install and configure Arena and how to use Arena to submit standalone PyTorch training jobs. You can run simple commands to submit and manage standalone PyTorch training jobs. This helps you improve training efficiency. For more information, see Use Arena to submit standalone PyTorch training jobs.

Use Arena to submit distributed PyTorch training jobs

You can use Arena to submit distributed PyTorch training jobs on Kubernetes. This topic describes how to use Arena to submit a distributed PyTorch training job that runs on multiple nodes in a Kubernetes cluster. You can modify the parameters of a training job to implement parallel model training in a distributed environment. This helps you improve training efficiency and increase the model size. For more information, see Use Arena to submit distributed PyTorch training jobs.

Elastic model training

ACK allows you to enable elastic model training based on scalable computing resources. You can dynamically adjust the amount of computing resources allocated to your training jobs based on actual workloads. This topic describes the benefits of elastic model training, including on-demand scaling, resource utilization improvement, and cost optimization. You can configure elastic training policies to manage and utilize computing resources in a flexible and efficient manner. For more information, see Elastic model training on Kubernetes.

Run DeepSpeed distributed training jobs

DeepSpeed is a framework used to optimize deep learning jobs. You can run DeepSpeed distributed training jobs on Kubernetes. This topic introduces the core features of DeepSpeed, including automatic mixed precision training, model sharding, and model optimizers, and describes how to use DeepSpeed to submit distributed training jobs in an ACK cluster. You can refer to this topic to improve model training efficiency and train large-scale models. For more information, see DeepSpeed distributed training.

Summary

  • Standalone TensorFlow training: provides guidance for running standalone TensorFlow training jobs on Kubernetes.

  • Distributed TensorFlow training: provides guidance for running distributed TensorFlow training jobs on Kubernetes.

  • Arena: provides guidance for using Arena to submit standalone and distributed PyTorch training jobs. Arena simplifies the deployment and management of training jobs.

  • Elastic model training: provides guidance for enabling elastic model training based on the scaling capability of Kubernetes to improve resource utilization and reduce costs.

  • DeepSpeed distributed training: provides guidance for using DeepSpeed to optimize distributed training and train large-scale models.

The preceding features and tools provide comprehensive support for efficient ML and deep learning job execution on Kubernetes, and help you improve training efficiency, optimize resource utilization, and reduce operational costs.