Ray,Ray cluster,PyTorch image classification batch inference - Elastic High Performance Computing

This topic describes the deployment process of a Ray cluster environment based on the E-HPC custom Ray cluster solution. It also introduces the basic practice of PyTorch-based image classification batch inference on the deployed Ray cluster. The Ray distributed computing framework is suitable for distributed training, simulation evaluation, and policy service execution tasks in AI scenarios, along with big data and various large-scale batch computing scenarios. It enables users to flexibly customize and develop functions according to specific business requirements.

Background information

Ray is a general-purpose open-source distributed computing framework, particularly suitable for machine learning, reinforcement learning, and other compute-intensive tasks. It provides a set of simple, flexible, efficient, and universal APIs to help developers easily build scalable distributed applications.

This topic combines E-HPC custom clusters and templates to provide an efficient deployment method for Ray clusters. It also supports the scale-out and scale-in of various cloud resources such as CPU and GPU, reducing the learning and maintenance costs for AI application developers and improving industry research efficiency.

Preparations

Choose one of the following methods to create a Ray cluster.
Important
Currently, Ray clusters can be created in the following regions: Hangzhou, Shanghai, and Beijing.
1. Create a cluster by using a template. For more information, see Template creation.
2. Manually create a cluster. For more information, see Create a standard cluster.

In this example, the following configurations are used for the cluster:

Configuration item	Configuration
Series	Standard Edition
Deployment mode	Custom cluster
Cluster type	CUSTOM
Node configuration	Includes `1` logon node and `1` compute node with the following specifications: Logon node: Uses the ecs.c8a.xlarge instance type, which is configured with 4 vCPU and 8 GiB memory. Compute node: Uses the ecs.c8a.xlarge instance type, which is configured with 4 vCPU and 8 GiB memory. Important After you Create A Cluster, scale out the Ray cluster compute nodes. For more information, see Create a node. Note Adjust the instance types of the logon and compute nodes according to your actual business requirements to increase or decrease the resource configuration.
Instance security group	Allow port 5901 (VNC access port) and port 8265 (Ray Dashboard access port). For more information, see Manage security group rules.
Image	Logon node/Compute node: ray_ubuntu20.04_v1.0 Note Select ray_ubuntu20.04_v1.0 from Community Image
System	Ubuntu 20.04 64-bit

Step 1: View the cluster status

Connect to the Ray cluster logon node remotely through Workbench. For more information, see Use Workbench to connect to a Linux instance over SSH.
Run the ray status command to view the cluster status.
```
ray status
```
View the Ray Dashboard.
1. Connect through VNC and enter the address IP:8265 in your browser to view the Genome graphical desktop. For more information, see Connect to an instance by using VNC.
2. (Optional) If you need public network access, click Instance ID > Attach Elastic IP Address to configure an Elastic IP Address.

Step 2: PyTorch image classification batch inference

There are two main ways to submit PyTorch inference tasks based on the created Ray cluster.

Submit through Ray job (recommended)
Note
The Ray job method is suitable for batch processing and scenarios that do not require a persistent connection. You can execute it on the Ray cluster Head node or on a remote machine that has network and port 8265 connectivity with the Head node.
1. Execute a single task.
  1. Connect to the Ray cluster logon node remotely through Workbench. For more information, see Use Workbench to connect to a Linux instance over SSH.
  2. Download the test file images.tar and extract it to the /home/test directory.
```
tar -zxvf images.tar -C /home/test/
```
  3. Download the ray_image_classify.py script to the /home/test directory.
  4. View the Ray cluster basic information.
```
python -c "import ray; ray.init(); print(ray.cluster_resources())"
```
  5. Submit a single Ray job, replacing IP with the logon node IP. Specify /home/test/images as the input file path and /home/test/images_prediction as the output path.
```
ray job submit --address http://IP:8265  --working-dir . -- /usr/local/fce/Python-3.11.9env/bin/python ray_image_classify.py /home/test/images /home/test/images_prediction
```
  6. The successful execution output is as follows:
    The classification prediction data and results are as follows: images is the input image data, images_prediction is the output data, and the predicted image classifications include tench, bittern, and coho.
  7. Connect through VNC and check the corresponding output directory. For more information, see Connect to an instance by using VNC.
2. Execute batch tasks.
  Taking the classification of all images in the images directory as an example (the number of jobs corresponds to the number of subdirectories), batch submit ray jobs and output classification prediction results.
  1. Create a ray_jobs_batch.sh script in the /home/test/ directory. Replace the IP in the export RAY_ADDRESS parameter with the logon node IP.
```
#!/bin/bash

# Input and output directories
input_dir=$1
output_dir=$2

export RAY_ADDRESS="http://IP:8265"

# Check if the input directory exists
if [ ! -d "$input_dir" ]; then
  echo "Input directory does not exist: $input_dir"
  exit 1
fi

# Check if the output directory exists
if [ ! -d "$output_dir" ]; then
  echo "Input directory does not exist, creating.."
  mkdir -p "$output_dir"
fi

# Traverse all subdirectories under input_dir
for subdir in "$input_dir"/*; do
  if [ -d "$subdir" ]; then
    subdir_name=$(basename "$subdir")
    input_subdir="${input_dir}/${subdir_name}"
    output_subdir="${output_dir}/${subdir_name}_prediction"
    
    # Submit Ray task
    echo "Submitting Ray job for directory: $input_subdir"
    ray job submit --no-wait --working-dir . -- /usr/local/fce/Python-3.11.9env/bin/python ray_image_classify.py $input_subdir $output_subdir
  fi
done

echo "All Ray jobs have been submitted."
```
  2. Execute the ray_jobs_batch.sh script.
```
source ray_jobs_batch.sh  /home/test/images  /home/test/images_prediction
```
  3. Run the ray job list command to view the task list.
```
ray job list
```
    You can also view it through the Ray Server Dashboard. The jobs in progress are shown below:
    The completed jobs are shown below:
  4. The classification prediction data and results are as follows:
    images_input1/images_input2/images_input3 are the input image data, and images_input1_prediction/images_input2_prediction/images_input3_prediction are the output data. The predicted image classifications include tench, barracouta, and coho.
    An example of the predicted carp classification image in images_input1_prediction is shown below:
Submit through Ray Client
Note
The Ray Client method is mainly suitable for interactive development and debugging task scenarios that require a persistent client connection.
1. Create a raytask.py script in the /home/test/ directory with the following content.
```
import ray

ray.init(address='auto')

# Define the task.
@ray.remote
def square(x):
    return x * x

# Start four parallel tasks.
futures = [square.remote(i) for i in range(4)]

# Get the results.
print(ray.get(futures))
# -> [0, 1, 4, 9]
```
2. Execute the raytask.py script.
```
python raytask.py
```

Background information

Preparations

Step 1: View the cluster status

Step 2: PyTorch image classification batch inference

Submit through Ray job (recommended)

Submit through Ray Client