This document shows how to run a typical reinforcement learning job on an ACK cluster by using the VeRL framework and the Qwen2.5-3B-Instruct model, including environment preparation, image building, job submission, resource monitoring, and best practices.
Container Service for Kubernetes (ACK) provides an efficient, elastic, and scalable containerized platform for enterprises. Reinforcement learning (RL), a key branch of artificial intelligence, often involves substantial computing resources, distributed training, and complex environment simulations. With ACK, you can easily deploy, manage, and scale RL training jobs by using the scheduling capabilities of Kubernetes and the elastic infrastructure of Alibaba Cloud. The following figure shows the component architecture for this job.

Prerequisites
You have created an ACK managed cluster.
We recommend using GPU instances to accelerate training. This example uses one Lingjun node with eight GU8TF GPUs.
You have obtained the cluster kubeconfig and connected to the cluster by using kubectl.
You have installed the KubeRay Operator component.
(Optional) Enable Object Storage Service (OSS) to persist model checkpoints, logs, and training data.
Step 1: Prepare the training image
This example uses the VeRL framework to run a reinforcement learning job. You can use the official VeRL image or build your own. If you build your own image, ensure that it includes all required dependencies, such as VeRL, vLLM, SGLang, and Ray. Here is an example Dockerfile:
from verl/verl:vllm012.latest
WORKDIR /home/verl
COPY . .
RUN apt update && apt install -y openssh-server vim
RUN apt remove python3-blinker -y; pip install -e .Step 2: Configure MCP Server and ACK Sandbox
Install the MCP Server and Sandbox components.
Open-source version
# Clone the code repository git clone https://github.com/openkruise/agents cd agents # Generate the deployment YAML for the agents operator kubectl kustomize config/default >operator-install.yaml # Modify the configuration in operator-install.yaml as needed kubectl apply -f operator-install.yaml # Deploy the test sandbox-manager. For more information, see https://github.com/openkruise/agents/blob/master/config/sandbox-manager/README.md kubectl kustomize config/sandbox-manager >sandbox-manager.yaml # The MCP code has not been merged yet, so you must manually change the sandbox-manager image to: # baicun-business-registry.cn-beijing.cr.aliyuncs.com/baicun-dev/sandbox:sandbox-manager-v12 # Verify that the management pods are running correctly kubectl get pod -l "app.kubernetes.io/name=sandbox-manager" -A kubectl get pod -l "app.kubernetes.io/name=sandbox-controller-manager" -AMarketplace version
On the Clusters page, click the name of the target cluster. In the left navigation pane, choose Add-ons.
Install the Ingress Controller and Sandbox-related components.
Install
ack-agent-sandbox-controllerInstall the component with the default configuration.
Install
ack-sandbox-managerPrepare an E2B domain name.
For detailed instructions on preparing a domain name, configuring DNS resolution, and applying for a certificate, see Use in a production environment.
Configure the component parameters.
Set
classNametoalb(this example uses an installed ALB Ingress Controller), setdomainto your actual domain name, and setadminApiKeyto a custom API key. Keep other settings at their default values. After installation, an Ingress namedsandbox-manageris created in thesandbox-systemnamespace.If you are using the ALB Ingress Controller, you must also add an HTTPS:443 listener configuration for both the ALB instance and the Ingress.
Save the following content as
sandbox.yamland runkubectl apply -f sandbox.yamlto deploy the Sandbox definition. The SandboxSet creates a warm pool of size 3. During reinforcement learning, the SandboxManager continuously consumes Sandboxes from this warm pool.--- apiVersion: v1 kind: Service metadata: name: mcp-sandbox spec: selector: app.kubernetes.io/instance: release-name app.kubernetes.io/name: ack-sandbox-manager component: sandbox-manager type: ClusterIP sessionAffinity: None sessionAffinityConfig: clientIP: timeoutSeconds: 10800 ports: - name: vllm protocol: TCP port: 8000 targetPort: 18082 --- apiVersion: agents.kruise.io/v1alpha1 kind: SandboxSet metadata: annotations: # Enable the Envd initialization capability of SandboxManager. e2b.agents.kruise.io/should-init-envd: "true" name: code-interpreter namespace: default spec: # The size of the warm pool. We recommend setting this slightly larger than the estimated request burst. replicas: 3 template: spec: initContainers: - name: init image: registry-cn-hangzhou.ack.aliyuncs.com/acs/agent-runtime:v0.0.1 imagePullPolicy: IfNotPresent terminationMessagePolicy: File volumeMounts: - name: envd-volume mountPath: /mnt/envd env: - name: ENVD_DIR value: /mnt/envd restartPolicy: Always containers: - name: sandbox image: acs-image-test-01-registry.cn-hangzhou.cr.aliyuncs.com/e2b/code-interpreter:v1.6 imagePullPolicy: IfNotPresent terminationMessagePolicy: File env: - name: ENVD_DIR value: /mnt/envd volumeMounts: - name: envd-volume mountPath: /mnt/envd lifecycle: postStart: exec: command: - bash - /mnt/envd/envd-run.sh startupProbe: failureThreshold: 20 successThreshold: 1 httpGet: path: /health port: 49999 scheme: HTTP initialDelaySeconds: 1 periodSeconds: 2 timeoutSeconds: 1 # Ensure fast container termination to increase the probability of reuse. terminationGracePeriodSeconds: 1 restartPolicy: Always dnsPolicy: ClusterFirst volumes: - name: envd-volume emptyDir: { }
(Optional) Step 3: Prepare the dataset
In VeRL, you can download datasets from a remote source by specifying data.train_files. However, because datasets are often large and require preprocessing, we recommend using a preprocessing job to download, preprocess, and upload the data to cloud storage in a production environment.
Save the following content as
data.yamland runkubectl apply -f data.yamlto download data from Hugging Face, preprocess it, and upload it to an OSS bucket.apiVersion: v1 kind: Secret metadata: name: hf-oss-credentials namespace: default type: Opaque stringData: # Hugging Face token HF_TOKEN: "hf_xxxxx" # Alibaba Cloud OSS credentials (the alibabacloud-oss-v2 SDK uses environment variables for authentication) akId: "xxx" akSecret: "xxx" OSS_REGION: "xxx" OSS_BUCKET: "xxx" --- apiVersion: v1 kind: ConfigMap metadata: name: preprocess-script namespace: default data: preprocess.py: | #!/usr/bin/env python3 """ Example dataset preprocessing script """ import os import json from datasets import load_from_disk def preprocess_dataset(input_dir, output_dir): """Preprocess the dataset""" print(f"Loading dataset from {input_dir}") dataset = load_from_disk(input_dir) train_dataset = dataset["train"] test_dataset = dataset["test"] instruction_following = "Let's think step by step and output the final answer after `####`." # add a row to each data item that represents a unique id def make_map_fn(split): def process_fn(example, idx): question_raw = example.pop("question") question = question_raw + " " + instruction_following answer_raw = example.pop("answer") solution = extract_solution(answer_raw) data = { "data_source": data_source, "agent_name": "tool_agent", "prompt": [ { "role": "system", "content": ( "You are a math expert. You are given a question and you need to solve it step by step. " "Reasoning step by step before any tool call. " "You should use the `calc_gsm8k_reward` tool after step by step solving the question, " "before generate final answer at least once and refine your answer if necessary. " "Put your final answer in the format of `#### <answer>`." ), }, { "role": "user", "content": question, }, ], "ability": "math", "reward_model": {"style": "rule", "ground_truth": solution}, "extra_info": { "split": split, "index": idx, "answer": answer_raw, "question": question_raw, "need_tools_kwargs": True, "tools_kwargs": { "calc_gsm8k_reward": { "create_kwargs": {"ground_truth": solution}, # "execute_kwargs": {}, # "calc_reward_kwargs": {}, # "release_kwargs": {}, }, }, "interaction_kwargs": { "query": question, "ground_truth": solution, }, }, } return data return process_fn train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True, num_proc=8) test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True, num_proc=8) # Save the processed dataset os.makedirs(output_dir, exist_ok=True) train_dataset.to_parquet(os.path.join(output_dir, "train.parquet")) test_dataset.to_parquet(os.path.join(output_dir, "test.parquet")) print(f"Processed dataset saved to {output_dir}") return output_dir if __name__ == "__main__": input_path = os.environ.get("INPUT_PATH", "/data/raw") output_path = os.environ.get("OUTPUT_PATH", "/data/processed") preprocess_dataset(input_path, output_path) --- apiVersion: batch/v1 kind: Job metadata: name: dataset-pipeline namespace: default labels: app: dataset-pipeline spec: backoffLimit: 3 template: metadata: labels: app: dataset-pipeline spec: restartPolicy: OnFailure volumes: # Preprocessing script - name: scripts configMap: name: preprocess-script defaultMode: 0755 containers: - name: dataset-pipeline image: python:3.10-slim command: - /bin/bash - -c - | set -e #========================================== # Step 1: Install all dependencies #========================================== echo "=== Installing dependencies ===" pip install --no-cache-dir datasets huggingface_hub pandas numpy alibabacloud-oss-v2 Pillow #========================================== # Step 2: Download the dataset from Hugging Face #========================================== echo "=== Downloading dataset from Hugging Face ===" python3 << 'EOF' import os from datasets import load_dataset from huggingface_hub import login # Log in to Hugging Face (required for private datasets) hf_token = os.environ.get("HF_TOKEN") if hf_token: login(token=hf_token) # Download the dataset dataset_name = os.environ.get("DATASET_NAME", "hiyouga/geometry3k") dataset_config = os.environ.get("DATASET_CONFIG", None) print(f"Downloading dataset: {dataset_name}") dataset = load_dataset(dataset_name, dataset_config) # Save locally output_path = "/data/raw" dataset.save_to_disk(output_path) print(f"Dataset saved to {output_path}") EOF echo "=== Download completed ===" #========================================== # Step 3: Run the preprocessing script #========================================== echo "=== Running preprocessing script ===" python3 /scripts/preprocess.py echo "=== Preprocessing completed ===" #========================================== # Step 4: Upload to OSS (using the alibabacloud-oss-v2 SDK) #========================================== echo "=== Uploading to OSS ===" python3 << 'EOF' import os from pathlib import Path import alibabacloud_oss_v2 as oss # OSS configuration bucket_name = os.environ["OSS_BUCKET"] region = os.environ["OSS_REGION"] oss_prefix = os.environ.get("OSS_PREFIX", "data/geo3k-processed/") local_path = os.environ.get("OUTPUT_PATH", "/data/processed") # Use the environment variable credentials provider (automatically reads OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET) credentials_provider = oss.credentials.EnvironmentVariableCredentialsProvider() # Load the default configuration and set the credentials provider cfg = oss.config.load_default() cfg.credentials_provider = credentials_provider cfg.region = region # Create an OSS client client = oss.Client(cfg) def upload_directory(local_dir, oss_prefix): """Recursively upload a directory to OSS""" local_path = Path(local_dir) uploaded_count = 0 failed_count = 0 for file_path in local_path.rglob("*"): if file_path.is_file(): relative_path = file_path.relative_to(local_path) oss_key = f"{oss_prefix}{relative_path}" try: # Read file content with open(file_path, 'rb') as f: data = f.read() # Upload to OSS result = client.put_object(oss.PutObjectRequest( bucket=bucket_name, key=oss_key, body=data, )) print(f"Uploaded: {file_path} -> {oss_key} (status: {result.status_code})") uploaded_count += 1 except Exception as e: print(f"Failed to upload {file_path}: {e}") failed_count += 1 return uploaded_count, failed_count uploaded, failed = upload_directory(local_path, oss_prefix) print(f"=== Upload completed: {uploaded} files uploaded, {failed} files failed ===") if failed > 0: raise Exception(f"{failed} files failed to upload") EOF echo "=== Pipeline completed successfully ===" env: # Hugging Face configuration - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-oss-credentials key: HF_TOKEN - name: DATASET_NAME value: "hiyouga/geometry3k" - name: HF_HOME value: "/tmp/huggingface" # Preprocessing configuration - name: INPUT_PATH value: "/data/raw" - name: OUTPUT_PATH value: "/data/processed" # OSS configuration - name: OSS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: hf-oss-credentials key: akId - name: OSS_ACCESS_KEY_SECRET valueFrom: secretKeyRef: name: hf-oss-credentials key: akSecret - name: OSS_REGION valueFrom: secretKeyRef: name: hf-oss-credentials key: OSS_REGION - name: OSS_BUCKET valueFrom: secretKeyRef: name: hf-oss-credentials key: OSS_BUCKET - name: OSS_PREFIX value: "data/geo3k-processed/" volumeMounts: - name: scripts mountPath: /scripts resources: requests: memory: "2Gi" cpu: "1" limits: memory: "16Gi" cpu: "4"
Step 4: Apply job configuration
Save the following content as
pvpvc.yamland runkubectl apply -f pvpvc.yamlto mount an OSS static PersistentVolume using a PersistentVolume and a PersistentVolumeClaim.The following example uses an AccessKey pair for authentication. For RRSA authentication, see Use a static PersistentVolume with ossfs 2.0.
apiVersion: v1 kind: PersistentVolume metadata: name: ym-dataset labels: alicloud-pvname: ym-dataset spec: capacity: storage: 20Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: ym-dataset # Must be the same as the PV name. nodePublishSecretRef: name: hf-oss-credentials namespace: default volumeAttributes: bucket: "xxxx" # Replace with your actual bucket name. url: "oss-ap-southeast-1-internal.aliyuncs.com" # Replace with your actual OSS endpoint. otherOpts: "-o umask=022 -o max_stat_cache_size=100000 -o allow_other -o dbglevel=debug -o curldbg" path: "/" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ym-dataset spec: accessModes: - ReadWriteMany resources: requests: storage: 20Gi selector: matchLabels: alicloud-pvname: ym-dataset # (Optional) The model can be downloaded on demand by specifying a Hugging Face repository path. --- apiVersion: v1 kind: PersistentVolume metadata: name: ym-models labels: alicloud-pvname: ym-models spec: capacity: storage: 20Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: ym-models # Must be the same as the PV name. nodePublishSecretRef: name: hf-oss-credentials namespace: default volumeAttributes: bucket: "xxxx" # Replace with your actual bucket name. url: "oss-ap-southeast-1-internal.aliyuncs.com" # Replace with your actual OSS endpoint. otherOpts: "-o umask=022 -o max_stat_cache_size=100000 -o allow_other -o dbglevel=debug -o curldbg" path: "/" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ym-models spec: accessModes: - ReadWriteMany resources: requests: storage: 20Gi selector: matchLabels: alicloud-pvname: ym-modelsSave the following content as
configs.yamland runkubectl apply -f configs.yamlto apply the job-related configurations.--- apiVersion: v1 kind: ConfigMap metadata: name: gsm8k-configs namespace: default data: gsm8k_multiturn_grpo.yaml: | hydra: searchpath: - file://verl/trainer/config defaults: - ppo_trainer - _self_ data: max_prompt_length: 1024 max_response_length: 1024 train_batch_size: 256 return_raw_chat: True actor_rollout_ref: hybrid_engine: True rollout: name: vllm multi_turn: enable: True max_assistant_turns: 5 mcp_server.json: | { "mcpServers": { "Tavily Expert": { "url": "xxxxx", # Replace with the Sandbox MCP Ingress endpoint. "api_key": "xxxxx" # If an API key is needed, you can add an NGINX container to the Ray cluster as a proxy. } } } gsm8k_mcp_tool_config.yaml: | tools: - class_name: verl.tools.mcp_search_tool.MCPSearchTool config: rate_limit: 120 timeout: 120 type: mcp mcp: mcp_servers_config_path: /var/configs/mcp_server.json tool_selected_list: - run_code_once - class_name: "verl.tools.gsm8k_tool.Gsm8kTool" config: type: native tool_schema: type: "function" function: name: "calc_gsm8k_reward" description: "A tool for calculating the reward of gsm8k. (1.0 if parsed answer is correct, 0.0 if parsed answer is incorrect or not correctly parsed)" parameters: type: "object" properties: answer: type: "string" description: "The model's answer to the GSM8K math problem, must be a digits" required: ["answer"]
Step 5: Submit the job
In VeRL, you can use the MCPSearchTool to query tools provided by the MCP Server. At the start of each use case, an AgentLoop connects to the MCP Server and calls tools during a multi-turn conversation.
Save the following content as
rayjob.yamland runkubectl apply -f rayjob.yamlto submit the reinforcement learning job.--- apiVersion: ray.io/v1 kind: RayJob metadata: name: rayjob-example namespace: default spec: shutdownAfterJobFinishes: false # ttlSecondsAfterFinished: 300 runtimeEnvYAML: | working_dir: /home/verl submissionMode: SidecarMode entrypoint: | python3 -m verl.trainer.main_ppo \ --config-path=/var/configs \ --config-name='gsm8k_multiturn_grpo' \ algorithm.adv_estimator=grpo \ data.train_batch_size=16 \ data.max_prompt_length=1024 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ data.return_raw_chat=True \ actor_rollout_ref.model.path=/var/model/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.mode=async \ actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ actor_rollout_ref.rollout.n=16 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ actor_rollout_ref.rollout.trace.backend=mlflow \ actor_rollout_ref.rollout.trace.token2text=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","mlflow"]' \ trainer.project_name='gsm8k_tool-agent' \ trainer.experiment_name='qwen2.5-3b_function_rm-gsm8k-vllm-tool-agent-verify-n16' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=1 \ trainer.save_freq=1 \ trainer.test_freq=20 \ trainer.total_training_steps=1 \ data.train_files=/var/model-dataset/processed-gsm8k/train20.parquet \ data.val_files=/var/model-dataset/processed-gsm8k/test100.parquet \ actor_rollout_ref.rollout.multi_turn.tool_config_path="/var/configs/gsm8k_mcp_tool_config.yaml" \ actor_rollout_ref.actor.checkpoint.save_contents='["hf_model", "model"]' \ trainer.total_epochs=1 rayClusterSpec: headGroupSpec: rayStartParams: dashboard-host: 0.0.0.0 serviceType: ClusterIP template: metadata: annotations: labels: spec: affinity: {} tolerations: - key: node-role.alibabacloud.com/lingjun containers: - env: - name: VERL_ROOT value: /home/verl image: registry-ap-southeast-1.ack.aliyuncs.com/dev/verl:vllm012.latest.43dc9a44 imagePullPolicy: IfNotPresent name: ray-head resources: limits: cpu: "100" memory: 500Gi nvidia.com/gpu: "8" securityContext: runAsUser: 0 volumeMounts: - mountPath: /var/configs name: configs - mountPath: /var/model name: model - mountPath: /var/model-dataset name: model-dataset imagePullSecrets: - name: regcred-hangzhou - name: regcred-ap-southeast volumes: - name: configs configMap: name: gsm8k-configs - name: model persistentVolumeClaim: claimName: ym-models - name: model-dataset persistentVolumeClaim: claimName: ym-dataset
