All Products
Search
Document Center

Container Service for Kubernetes:Submit a job in a Ray cluster

Last Updated:Mar 26, 2026

Ray is a distributed computing framework for scaling AI and Python workloads on Container Service for Kubernetes (ACK). This topic walks you through submitting a job from inside the head node of an ACK-hosted Ray cluster to run distributed tasks such as model training, data processing, and model evaluation.

How it works

Submitting a job in a local Ray cluster follows these steps:

  1. Connect to the head node Pod using kubectl exec.

  2. Place your Python script on the head node.

  3. Run the script. Ray distributes the work across the cluster automatically.

This approach runs the job from inside the cluster. For remote submission from your local machine, see Ray Client and the Ray Jobs CLI quickstart.

Prerequisites

Before you begin, ensure that you have:

  • A Ray cluster created on ACK. See Create a Ray cluster on ACK.

  • kubectl configured and connected to your ACK cluster.

  • The ${RAY_CLUSTER_NS} environment variable set to the namespace where your Ray cluster is deployed.

Submit a Ray job

Step 1: Find the head node Pod

Run the following command to list the Pods in your Ray cluster namespace:

kubectl get pod -n ${RAY_CLUSTER_NS}

Expected output:

NAME                                           READY   STATUS    RESTARTS   AGE
myfirst-ray-cluster-head-v7pbw                 2/2     Running   0          39m

Note the head node Pod name. You need it in the next step.

Step 2: Connect to the head node Pod

Open a Bash shell inside the head node Pod. Replace myfirst-ray-cluster-head-v7pbw with your actual Pod name.

kubectl exec -it -n ${RAY_CLUSTER_NS} myfirst-ray-cluster-head-v7pbw -- bash

Step 3: Create the job script

Inside the Pod, use echo or cat to save the following script as my_script.py:

import ray
import os

# Connect to a local or remote Ray cluster
ray.init()

# Define a remote actor that runs on 1 CPU
@ray.remote(num_cpus=1)
class Counter:
    def __init__(self):
        self.name = "test_counter"
        self.counter = 0

    def increment(self):
        self.counter += 1

    def get_counter(self):
        return "{} got {}".format(self.name, self.counter)

counter = Counter.remote()

# Run 10,000 increments across the cluster
for _ in range(10000):
    counter.increment.remote()
    print(ray.get(counter.get_counter.remote()))

Step 4: Run the job

python my_script.py

Expected output:

2024-01-24 04:25:27,286	INFO worker.py:1329 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2024-01-24 04:25:27,286	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 172.16.0.236:6379...
2024-01-24 04:25:27,295	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://172.16.0.236:8265
test_counter got 0
test_counter got 1
test_counter got 2
test_counter got 3
...

What's next