Ray is a distributed computing framework for scaling AI and Python workloads on Container Service for Kubernetes (ACK). This topic walks you through submitting a job from inside the head node of an ACK-hosted Ray cluster to run distributed tasks such as model training, data processing, and model evaluation.
How it works
Submitting a job in a local Ray cluster follows these steps:
-
Connect to the head node Pod using
kubectl exec. -
Place your Python script on the head node.
-
Run the script. Ray distributes the work across the cluster automatically.
This approach runs the job from inside the cluster. For remote submission from your local machine, see Ray Client and the Ray Jobs CLI quickstart.
Prerequisites
Before you begin, ensure that you have:
-
A Ray cluster created on ACK. See Create a Ray cluster on ACK.
-
kubectlconfigured and connected to your ACK cluster. -
The
${RAY_CLUSTER_NS}environment variable set to the namespace where your Ray cluster is deployed.
Submit a Ray job
Step 1: Find the head node Pod
Run the following command to list the Pods in your Ray cluster namespace:
kubectl get pod -n ${RAY_CLUSTER_NS}
Expected output:
NAME READY STATUS RESTARTS AGE
myfirst-ray-cluster-head-v7pbw 2/2 Running 0 39m
Note the head node Pod name. You need it in the next step.
Step 2: Connect to the head node Pod
Open a Bash shell inside the head node Pod. Replace myfirst-ray-cluster-head-v7pbw with your actual Pod name.
kubectl exec -it -n ${RAY_CLUSTER_NS} myfirst-ray-cluster-head-v7pbw -- bash
Step 3: Create the job script
Inside the Pod, use echo or cat to save the following script as my_script.py:
import ray
import os
# Connect to a local or remote Ray cluster
ray.init()
# Define a remote actor that runs on 1 CPU
@ray.remote(num_cpus=1)
class Counter:
def __init__(self):
self.name = "test_counter"
self.counter = 0
def increment(self):
self.counter += 1
def get_counter(self):
return "{} got {}".format(self.name, self.counter)
counter = Counter.remote()
# Run 10,000 increments across the cluster
for _ in range(10000):
counter.increment.remote()
print(ray.get(counter.get_counter.remote()))
Step 4: Run the job
python my_script.py
Expected output:
2024-01-24 04:25:27,286 INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2024-01-24 04:25:27,286 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 172.16.0.236:6379...
2024-01-24 04:25:27,295 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://172.16.0.236:8265
test_counter got 0
test_counter got 1
test_counter got 2
test_counter got 3
...
What's next
-
Monitor the job: Access Ray Dashboard to view job status, resource utilization, and logs. For access from your local machine, see Access Ray Dashboard from the local network.
-
Scale the cluster automatically: Use the Ray autoscaler with the ACK autoscaler to add or remove Elastic Compute Service (ECS) nodes based on workload. See Elastic scaling based on the Ray autoscaler and ACK autoscaler.
-
Scale Elastic Container Instance (ECI) nodes: See Elastic scaling of Elastic Container Instance nodes based on the Ray autoscaler.