How to create and manage a RAY resource group - Lindorm - Alibaba Cloud Documentation Center

Important

The RAY resource group is currently in invitational preview. To request access, contact Lindorm technical support (DingTalk ID: s0s3eg3).

The RAY resource group for the Lindorm compute engine provides distributed computing for end-to-end AI workload processing. It is compatible with the Ray computing model and programming interfaces, and integrates with Lindorm's multi-model storage engine to handle data preprocessing, training, and inference tasks.

Limitations

RAY resource groups do not currently support modify or restart operations.

Prerequisites

Before you begin, ensure that you have:

Enabled LindormTable
Enabled the Lindorm compute engine

Billing

RAY resource groups operate in persistent mode. Fees consist of two parts:

Persistent resource fees: Charged in compute units (CUs) based on the persistent resources configured for the head and worker nodes.
Elastic resource fees: Worker nodes support elastic scaling based on workload. CUs are charged based on usage duration for elastically scaled worker nodes.

Create a RAY resource group

Note

Creation takes about 20 minutes to complete.

Log on to the Lindorm console. In the upper-left corner, select the region of your instance. On the Instances page, click the instance ID or click View Instance Details in the Actions column.
On the Instance Details page, in the Configurations section, click Resource Groups in the Operations column for the Compute Engine.

On the Resource Group Details page, click Create Resource Group and configure the following parameters:

Parameter	Description
Resource group type	Select RAY.
Resource group name	Enter a name using only lowercase letters and numbers. Maximum 63 characters. Example: `raycg`.
Running mode	Defaults to Resident. In Resident mode, the Ray cluster is always running. When no jobs are running, the cluster operates with minimal resources. When a job is submitted, the cluster dynamically requests resources based on the job's requirements.

Configure the head node and worker groups:

Head node

Parameter	Description
Head resource type	Select CPU or GPU. For GPU resources, contact Lindorm technical support (DingTalk ID: s0s3eg3). GPU resources are subject to machine type and inventory limitations.
Head resource specifications	For CPU: select a quota such as 4 cores 8 GB, 4 cores 16 GB, or 8 cores 32 GB. Default: 4 cores 16 GB. For GPU: contact Lindorm technical support (DingTalk ID: s0s3eg3).
Head disk size	Disk space for storing logs, memory overflow files, and job resource files. Default: 30 GB.

Worker groups

Select one or more worker groups. Each worker group can have different resource specifications.

Parameter	Description
Worker resource type	Select CPU or GPU. For GPU resources, contact Lindorm technical support (DingTalk ID: s0s3eg3). GPU resources are subject to machine type and inventory limitations.
Worker resource specifications	For CPU: select a quota such as 4 cores 8 GB, 4 cores 16 GB, or 8 cores 32 GB. Default: 4 cores 16 GB. For GPU: contact Lindorm technical support (DingTalk ID: s0s3eg3).
Worker disk space	Disk space for storing logs, memory overflow files, and job resource files. Default: 30 GB.
Minimum number of workers	The minimum number of replicas in the worker group. The cluster maintains this count when no jobs are running.
Maximum number of workers	The maximum number of replicas that can be provisioned when jobs are running.

Click OK.

Manage a RAY resource group

After a RAY resource group is created, use the built-in WebUI to monitor its running status and manage jobs.

Log on to the Lindorm console. In the upper-left corner, select the region of your instance. On the Instances page, click the instance ID or click View Instance Details in the Actions column.
On the Instance Details page, in the Configurations section, click Resource Groups in the Operations column for Compute Engine.
On the Resource Group Details page, hover over WebUI in the Actions column for the RAY resource group to get its address. For example: http://alb-57k7r581oht8rd****.cn-hangzhou.alb.aliyuncsslb.com/ray/raycg/dashboard/.

Open the WebUI address in a browser. The WebUI provides four tabs:

Tab	What you can see
Jobs	All submitted jobs and their status
Cluster	Resource usage for all nodes, including CPU, memory, GPU, and Object Store
Actors	Active actors in the cluster
Logs	Cluster logs

(Optional) To delete a resource group, click Delete in the Actions column on the Resource Group Details page.