PyTorch is a widely used machine learning framework that helps model developers implement multi-machine and multi-GPU distributed training jobs. In Kubernetes, you can use PyTorchJob to submit machine learning jobs within the PyTorch framework. This topic describes how to use Kube Queue to manage jobs on a Fleet instance and how to use the gang scheduling feature when you assign resources on a Fleet instance.
Architecture
To implement multi-machine and multi-GPU distributed training and ensure that the pods of the training jobs can run as expected, the workloads must comply with the gang scheduling semantics during scheduling. All workloads of the training job must be running or no workload must be running. The Fleet instance implements the scheduling of PyTorchJob across multiple clusters by using Kube Queue and ACK Scheduler to ensure that the gang semantics declared in PyTorchJob are maintained during scheduling.
Prerequisites
Cloud-native AI suite is installed in the sub-clusters. You need to only deploy Arena.
The Resource Access Management (RAM) policy AliyunAdcpFullAccess is attached to a RAM user. For more information, see Grant permissions to RAM users.
The AMC command-line tool is installed. For more information, see Use AMC.
(Optional) If resource reservation is enabled for the cluster, the Fleet instance uses resource reservation to ensure that the scheduling result of the Fleet instance is the same as the scheduling result of the sub-cluster. If resource reservation is disabled for a cluster, the Fleet instance evaluates resources and schedules jobs by comparing total number of remaining resources of all nodes in the sub-cluster with the total number of required resources. The following section describes how to enable resource reservation.
NoteTo enable resource reservation, the Kubernetes version of the cluster must be 1.28 or later and the scheduler version must be 6.8.0 or later.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Add-ons page, find the Kube Scheduler component, click Configuration to go to the Kube Scheduler Parameters page, set the enableReservation parameter to true, and then click OK.
(Optional) Step 1: Use ack-kube-queue to queue jobs and manage quotas
In a multi-cluster Fleet, you can use the ElasticQuotaTree to manage quotas and configure queues. This helps you queue a large number of jobs in multi-user scenarios.
The following sample code provides an example on how to submit an ElasticQuotaTree.
Cluster administrators can submit an ElasticQuotaTree on the Fleet instance to configure job queues. In this example, a quota in the
default
namespace is configured, which includes a total of 10000 CPUs, 10000 GiB memory, 10000 GPUs, and 1 job.apiVersion: scheduling.sigs.k8s.io/v1beta1 kind: ElasticQuotaTree metadata: name: elasticquotatree # Only single ElasticQuotaTree is supported. namespace: kube-system # The elastic quota group takes effect only if you create the group in the kube-system namespace. spec: root: name: root # The maximum amount of resources for the root must be equal to the minimum amount of resources for the root. max: cpu: 999900 memory: 400000Gi kube-queue/max-jobs: 10000000000 nvidia.com/gpu: 100000 min: cpu: 9999 memory: 40Gi children: - name: child-2 max: # Only one job can be dequeued at a time. kube-queue/max-jobs: 1 cpu: 10000 nvidia.com/gpu: 10000 memory: 10000Gi namespaces: # Configure the namespace. - default
Run the following command on the Fleet instance to view the ElasticQuotaTree and the queues created by Kube Queue:
kubectl get queue -n kube-queue
Expected output:
NAME AGE root-child-2-v5zxz 15d root-kdzw7 15d
Step 2: Submit a PyTorchJob for multi-cluster scheduling
Submit a PropagationPolicy on the Fleet instance and set Custom Policy to Gang.
Use only the Gang scheduling policy
When you use the gang scheduling feature, you must specify the
customSchedulingType=Gang
parameter in the submittedPropagationPolicy
.apiVersion: policy.one.alibabacloud.com/v1alpha1 kind: PropagationPolicy metadata: name: example-policy # The default namespace is `default`. spec: propagateDeps: true # Reschedule the job when the job fails to run. failover: application: decisionConditions: tolerationSeconds: 30 purgeMode: Immediately placement: replicaScheduling: replicaSchedulingType: Divided # Use gang scheduling. customSchedulingType: Gang resourceSelectors: - apiVersion: kubeflow.org/v1 kind: PyTorchJob
Use ElasticQuotaTree and gang scheduling
You must specify the
customSchedulingType=Gang
parameter in the submittedPropagationPolicy
, and set.Spec.Suspension.Scheduling
totrue
. This way, jobs can be added to the queue for scheduling.apiVersion: policy.one.alibabacloud.com/v1alpha1 kind: PropagationPolicy metadata: name: example-policy # The default namespace is `default`. spec: propagateDeps: true # Deschedule the job when the job fails to run. failover: application: decisionConditions: tolerationSeconds: 30 purgeMode: Immediately placement: replicaScheduling: replicaSchedulingType: Divided # Use gang scheduling. customSchedulingType: Gang resourceSelectors: - apiVersion: kubeflow.org/v1 kind: PyTorchJob
The following sample code provides an example on how to submit a PyTorchJob on the Fleet instance.
Step 3: View the job status
Run the following command on the Fleet instance to query the status of the PyTorchJob:
kubectl get pytorchjob
Expected output:
NAME STATE AGE pytorch-test Created 3m44s
Run the following command on the Fleet instance to check which associated cluster the PyTorchJob is scheduled to:
kubectl describe pytorchjob pytorch-test
Expected output:
Normal ScheduleBindingSucceed 4m59s default-scheduler Binding has been scheduled successfully. Result: {cfxxxxxx:0,[{master 1} {worker 2}]}
Run the following command on the Fleet instance to query the status of the PyTorchJob in the associated cluster:
kubectl amc get pytorchjob -M
Expected output:
NAME CLUSTER STATE AGE ADOPTION pytorch-test cfxxxxxx Running 6m23s Y
Run the following command on the Fleet instance to query the status of the pod:
kubectl amc get pod -M
Expected output:
NAME CLUSTER READY STATUS RESTARTS AGE pytorch-test-master-0 cfxxxxxx 1/1 Running 0 7m16s pytorch-test-worker-0 cfxxxxxx 1/1 Running 0 7m16s pytorch-test-worker-1 cfxxxxxx 1/1 Running 0 7m16s
Run the following command on the Fleet instance to view the details of the PyTorchJob in the associated cluster:
kubectl amc get pytorchjob pytorch-test -m ${member clusterid} -oyaml