All Products
Search
Document Center

Elastic High Performance Computing:PyTorch image classification batch inference for custom Ray cluster

Last Updated:Jun 16, 2025

This topic describes the deployment process of a Ray cluster environment based on the E-HPC custom Ray cluster solution. It also introduces the basic practice of PyTorch-based image classification batch inference on the deployed Ray cluster. The Ray distributed computing framework is suitable for distributed training, simulation evaluation, and policy service execution tasks in AI scenarios, along with big data and various large-scale batch computing scenarios. It enables users to flexibly customize and develop functions according to specific business requirements.

Background information

Ray is a general-purpose open-source distributed computing framework, particularly suitable for machine learning, reinforcement learning, and other compute-intensive tasks. It provides a set of simple, flexible, efficient, and universal APIs to help developers easily build scalable distributed applications.

This topic combines E-HPC custom clusters and templates to provide an efficient deployment method for Ray clusters. It also supports the scale-out and scale-in of various cloud resources such as CPU and GPU, reducing the learning and maintenance costs for AI application developers and improving industry research efficiency.

Preparations

  1. Choose one of the following methods to create a Ray cluster.

    Important

    Currently, Ray clusters can be created in the following regions: Hangzhou, Shanghai, and Beijing.

    1. Create a cluster by using a template. For more information, see Template creation.

      image

    2. Manually create a cluster. For more information, see Create a standard cluster.

  2. In this example, the following configurations are used for the cluster:

    Configuration item

    Configuration

    Series

    Standard Edition

    Deployment mode

    Custom cluster

    Cluster type

    CUSTOM

    Node configuration

    Includes 1 logon node and 1 compute node with the following specifications:

    • Logon node: Uses the ecs.c8a.xlarge instance type, which is configured with 4 vCPU and 8 GiB memory.

    • Compute node: Uses the ecs.c8a.xlarge instance type, which is configured with 4 vCPU and 8 GiB memory.

      Important

      After you Create A Cluster, scale out the Ray cluster compute nodes. For more information, see Create a node.

      Note

      Adjust the instance types of the logon and compute nodes according to your actual business requirements to increase or decrease the resource configuration.

    Instance security group

    Allow port 5901 (VNC access port) and port 8265 (Ray Dashboard access port). For more information, see Manage security group rules.

    Image

    Logon node/Compute node: ray_ubuntu20.04_v1.0

    Note

    Select ray_ubuntu20.04_v1.0 from Community Image

    image

    System

    Ubuntu 20.04 64-bit

Step 1: View the cluster status

  1. Connect to the Ray cluster logon node remotely through Workbench. For more information, see Use Workbench to connect to a Linux instance over SSH.

    Run the ray status command to view the cluster status.

    ray status

    image.png

  2. View the Ray Dashboard.

    1. Connect through VNC and enter the address IP:8265 in your browser to view the Genome graphical desktop. For more information, see Connect to an instance by using VNC.

      image

    2. (Optional) If you need public network access, click Instance ID > Attach Elastic IP Address to configure an Elastic IP Address.

      image

Step 2: PyTorch image classification batch inference

There are two main ways to submit PyTorch inference tasks based on the created Ray cluster.

  1. Submit through Ray job (recommended)

    Note

    The Ray job method is suitable for batch processing and scenarios that do not require a persistent connection. You can execute it on the Ray cluster Head node or on a remote machine that has network and port 8265 connectivity with the Head node.

    1. Execute a single task.

      1. Connect to the Ray cluster logon node remotely through Workbench. For more information, see Use Workbench to connect to a Linux instance over SSH.

      2. Download the test file images.tar and extract it to the /home/test directory.

        tar -zxvf images.tar -C /home/test/
      3. Download the ray_image_classify.py script to the /home/test directory.

      4. View the Ray cluster basic information.

        python -c "import ray; ray.init(); print(ray.cluster_resources())"
      5. Submit a single Ray job, replacing IP with the logon node IP. Specify /home/test/images as the input file path and /home/test/images_prediction as the output path.

        ray job submit --address http://IP:8265  --working-dir . -- /usr/local/fce/Python-3.11.9env/bin/python ray_image_classify.py /home/test/images /home/test/images_prediction
      6. The successful execution output is as follows:

        image

        The classification prediction data and results are as follows: images is the input image data, images_prediction is the output data, and the predicted image classifications include tench, bittern, and coho.

        image

      7. Connect through VNC and check the corresponding output directory. For more information, see Connect to an instance by using VNC.

        image

    2. Execute batch tasks.

      Taking the classification of all images in the images directory as an example (the number of jobs corresponds to the number of subdirectories), batch submit ray jobs and output classification prediction results.

      1. Create a ray_jobs_batch.sh script in the /home/test/ directory. Replace the IP in the export RAY_ADDRESS parameter with the logon node IP.

        #!/bin/bash
        
        # Input and output directories
        input_dir=$1
        output_dir=$2
        
        export RAY_ADDRESS="http://IP:8265"
        
        # Check if the input directory exists
        if [ ! -d "$input_dir" ]; then
          echo "Input directory does not exist: $input_dir"
          exit 1
        fi
        
        # Check if the output directory exists
        if [ ! -d "$output_dir" ]; then
          echo "Input directory does not exist, creating.."
          mkdir -p "$output_dir"
        fi
        
        # Traverse all subdirectories under input_dir
        for subdir in "$input_dir"/*; do
          if [ -d "$subdir" ]; then
            subdir_name=$(basename "$subdir")
            input_subdir="${input_dir}/${subdir_name}"
            output_subdir="${output_dir}/${subdir_name}_prediction"
            
            # Submit Ray task
            echo "Submitting Ray job for directory: $input_subdir"
            ray job submit --no-wait --working-dir . -- /usr/local/fce/Python-3.11.9env/bin/python ray_image_classify.py $input_subdir $output_subdir
          fi
        done
        
        echo "All Ray jobs have been submitted."
      2. Execute the ray_jobs_batch.sh script.

        source ray_jobs_batch.sh  /home/test/images  /home/test/images_prediction
      3. Run the ray job list command to view the task list.

        ray job list

        image

        You can also view it through the Ray Server Dashboard. The jobs in progress are shown below:

        image

        The completed jobs are shown below:

        image

      4. The classification prediction data and results are as follows:

        images_input1/images_input2/images_input3 are the input image data, and images_input1_prediction/images_input2_prediction/images_input3_prediction are the output data. The predicted image classifications include tench, barracouta, and coho.

        image

        An example of the predicted carp classification image in images_input1_prediction is shown below:

        image.png

  2. Submit through Ray Client

    Note

    The Ray Client method is mainly suitable for interactive development and debugging task scenarios that require a persistent client connection.

    1. Create a raytask.py script in the /home/test/ directory with the following content.

      import ray
      
      ray.init(address='auto')
      
      # Define the task.
      @ray.remote
      def square(x):
          return x * x
      
      # Start four parallel tasks.
      futures = [square.remote(i) for i in range(4)]
      
      # Get the results.
      print(ray.get(futures))
      # -> [0, 1, 4, 9]
    2. Execute the raytask.py script.

      python raytask.py

      image