All Products
Search
Document Center

Elastic High Performance Computing:Scheduling container jobs using a standard slurm cluster

Last Updated:May 07, 2025

This topic describes the entire process from cluster creation to job scheduling for users who need to run containerized tasks in a SLURM cluster.

Step 1: Create a cluster

  1. Create a Standard Edition SLURM cluster.

    The following table describes the example cluster configuration used in this topic. Configure other parameters as needed.

    Configuration item

    Configuration

    Cluster Configuration

    Series

    Standard Edition

    Deployment Mode

    Public Cloud Cluster

    Cluster Type

    SLURM

    Control Plane Node

    • Instance Type: Use the ecs.r7.xlarge instance type, which has 4 vCPUs and 32 GiB of memory.

    • Image: centos_7_6_x64_20G_alibase_20211130.vhd

    Compute Nodes And Queues

    Number Of Queue Nodes

    Initial nodes: 1.

    Node Interconnection

    ERDMA Network

    Instance Type Group

    • Instance Type: Use the ecs.gn7i-c56g1.14xlarge instance type, which has 56 vCPUs and 346 GiB of memory.

      Important

      You must use an instance type that supports GPUs. For more information, see Instance family.

    • Image: centos_7_6_x64_20G_alibase_20211130.vhd

    Shared File Storage

    /home Cluster Mount Directory

    By default, the /home and /opt directories of the control plane node are mounted to the file system as shared storage directories.

    /opt Cluster Mount Directory

    Software And Service Components

    Software To Install

    Select docker.

    Service Components Available For Installation

    Logon Node:

    • Instance Type: Use the ecs.r7.xlarge instance type, which has 4 vCPUs and 32 GiB of memory.

    • Image: centos_7_6_x64_20G_alibase_20211130.vhd.

  2. Create a cluster user.

    In this topic, the usertest user is used as an example.

Step 2: Set up the basic software environment

  1. Attach an Elastic IP Address to the compute node. For more information, see Elastic IP Address.

  2. Connect to the compute node remotely.

  3. Download and install CUDA.

    1. Download the CUDA installation package.

      cd /opt
      wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
    2. Install CUDA.

      yum install -y git
      sh /opt/cuda_12.4.1_550.54.15_linux.run

      When the following figure appears, CUDA is installed.

      image

    3. Configure the environment variables.

      echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
      echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
      
      source ~/.bashrc
    4. View the installation status and version information of the NVIDIA CUDA toolkit and GPU driver.

      # Version information of the NVIDIA CUDA compiler driver
      nvcc --version
      
      # Detailed status information of the GPU
      nvidia-smi

      When the following figure appears, CUDA and the GPU driver are working properly.

      image

  4. Install and configure NVIDIA Container Toolkit.

    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    sudo yum install -y nvidia-container-toolkit
    sudo systemctl restart docker
  5. Download and install Singularity.

    Singularity is a containerization tool that allows you to run containers without changing the user environment. It is commonly used in HPC environments.

    cd /opt
    wget https://public-ehs.oss-cn-hangzhou.aliyuncs.com/softwares/packages/CentOS_7.2_64/singularity-3.8.3-1.el7.x86_64.rpm
    yum install -y /opt/singularity-3.8.3-1.el7.x86_64.rpm
  6. Create job dependency data.

    1. Pull the PyTorch container image.

      docker pull ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.4.0-cuda12.1.1-py310-alinux3.2104
    2. Use the usertest user to create a main.py file.

      vim /home/usertest/main.py
    3. The script content of the main.py file is as follows.

      # -*- coding: utf-8 -*-
      
      import torch
      import torchvision
      import torchvision.transforms as transforms
      from torch import nn
      from torch.utils.data import DataLoader
      from torch.optim import SGD
      
      class SimpleNet(nn.Module):
          def __init__(self):
              super(SimpleNet, self).__init__()
              self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
              self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
              self.fc1 = nn.Linear(64 * 8 * 8, 128)
              self.fc2 = nn.Linear(128, 10)
              self.pool = nn.MaxPool2d(2, 2)
              self.relu = nn.ReLU()
              self.dropout = nn.Dropout(0.5)
      
          def forward(self, x):
              x = self.pool(self.relu(self.conv1(x)))
              x = self.pool(self.relu(self.conv2(x)))
              x = x.view(-1, 64 * 8 * 8)
              x = self.relu(self.fc1(x))
              x = self.dropout(x)
              x = self.fc2(x)
              return x
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      print(f"Using device: {device}")
      
      transform = transforms.Compose([
          transforms.ToTensor(),
          transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
      ])
      
      train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
      test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
      
      train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)
      test_loader = DataLoader(dataset=test_dataset, batch_size=64, shuffle=False)
      
      model = SimpleNet().to(device)
      criterion = nn.CrossEntropyLoss()
      optimizer = SGD(model.parameters(), lr=0.001, momentum=0.9)
      
      num_epochs = 10
      for epoch in range(num_epochs):
          model.train()
          running_loss = 0.0
          for i, data in enumerate(train_loader, 0):
              inputs, labels = data
              inputs, labels = inputs.to(device), labels.to(device)
      
              optimizer.zero_grad()
              outputs = model(inputs)
              loss = criterion(outputs, labels)
              loss.backward()
              optimizer.step()
      
              running_loss += loss.item()
              if i % 100 == 99: 
                  print(f"[{epoch + 1}, {i + 1}] loss: {running_loss / 100:.3f}")
                  running_loss = 0.0
      
          model.eval()
          correct = 0
          total = 0
          with torch.no_grad():
              for data in test_loader:
                  images, labels = data
                  images, labels = images.to(device), labels.to(device)
                  outputs = model(images)
                  _, predicted = torch.max(outputs.data, 1)
                  total += labels.size(0)
                  correct += (predicted == labels).sum().item()
      
          print(f"Accuracy on test set: {100 * correct / total:.2f}%")
      
      print("Training finished.")

Step 3: Schedule jobs

Schedule Docker jobs

Expand to view Docker script introduction

#!/bin/bash
#SBATCH --job-name=your_job_name
#SBATCH --output=your_output
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --ntasks=2

# args
iamges="your_image"
run_cmd="your_command"
work_dir="your_workdir"
share_dir="/ehpcdata/:/mnt"
devices="device=$CUDA_VISIBLE_DEVICES"

# cleanup docker handle
function cleanup {
    echo "Caught signal, stopping Docker container: " $SLURM_JOB_NAME
    docker ps -q --filter label=$SLURM_JOB_NAME | xargs -r docker stop
    docker ps -qa --filter label=$SLURM_JOB_NAME | xargs -r docker rm
}
trap cleanup SIGINT SIGTERM

# start docker
docker pull $iamges
srun docker run \
  --label $SLURM_JOB_NAME \
  --rm --net=host \
  --gpus '"'$devices'"' \
  -v $share_dir \
  $iamges \
  /bin/bash -c "$run_cmd" &

# wait to complete
wait
cleanup
Note

Script description:

  1. Docker runtime resources are allocated by the Slurm scheduler.

  2. The GPU ID (CUDA_VISIBLE_DEVICES) is allocated by Slurm and passed to the Docker runtime through the --gpus parameter. This method supports GPU physical isolation and GPU ID index mapping (nvidia-smi starts from 0).

  3. Only the root user can execute commands inside the Docker image. Slurm scheduling parameter environment variables cannot be automatically passed to the container. They need to be explicitly passed through the command in Docker run. This is different from Singularity.

  4. Because Docker containers do not exit when scancel is used on a job, a signal mechanism is designed to ensure that images started by Slurm jobs automatically stop when the job is completed or exits. This mechanism uses the job name as the image label, and when the job exits, it filters by job name to stop the corresponding image.

  5. The job script includes the following parts:

    1. Slurm scheduling parameters, including resource requirements, job name (required), input/output information, etc.

    2. Environment variable settings, including run commands and image address (repository or local). These are used as Docker startup parameters.

    3. Cleanup handle, which is fixed and does not need to be modified.

    4. Docker startup command.

    5. Exit (wait to complete).

  6. Docker run command:

    1. --name: Container name, same as the Slurm job name.

    2. --gpus: GPU ID used by the container, allocated by the Slurm scheduler.

    3. -v: Specifies the container shared directory. It is recommended to map the working directory including code and models to the container. Note: The container uses root privileges internally, and files created by default belong to root.

    4. Image & command: Sets the image name and execution command.

Submit jobs through E-HPC Portal

  1. Log on to E-HPC Portal.

  2. Submit an NCCL job.

    1. In the top navigation bar, select Job Management, click submitter at the top of the page, and on the Create Job page, set Number Of Compute Nodes to 1, Number Of Tasks to 2, and Number Of GPUs to 1.

    2. The job script content is as follows.

      Use the docker images command to obtain the image name and version number, and replace your_image in the third line.

      #!/bin/bash
      
      image="your_image"
      run_cmd="python main.py"
      share_dir="/home/usertest/:/root"
      
      # cleanup docker handle
      function cleanup {
          echo "Caught signal, stopping Docker container: " $SLURM_JOB_NAME
          docker ps -q --filter label=$SLURM_JOB_NAME | xargs -r docker stop
          docker ps -qa --filter label=$SLURM_JOB_NAME | xargs -r docker rm
      }
      trap cleanup SIGINT SIGTERM
      cleanup
      
      # start docker
      # docker pull $image
      
      docker run \
        --label $SLURM_JOB_NAME \
        --gpus "device=0" \
        -v $share_dir \
        $image \
        /bin/bash -c "$run_cmd" &
      
      # wait to complete
      wait
      cleanup
  3. Query jobs.

    Go to the Job Management page to view the job list, which includes job status, job operations, and more. For more information, see Query jobs.

Submit jobs through the command line

  1. Submit jobs through the command line. For more information, see SLURM.

  2. The job script content is as follows.

    Use the docker images command to obtain the image name and version number, and replace your_image in the thirteenth line.

    #!/bin/bash
    
    #SBATCH --job-name=tf_sample_job
    #SBATCH --nodes=1
    #SBATCH --nt
    #SBATCH --gpus-per-task=1
    #SBATCH --time=01:00:00
    #SBATCH --partition=comp
    #SBATCH --output=tf_sample_job_%j.out
    #SBATCH --error=tf_sample_job_%j.err
    
    # Define variables
    image="your_image"
    run_cmd="python main.py"
    share_dir="/home/usertest/:/root"
    
    # Clean up Docker processes
    function cleanup {
        echo "Caught signal, stopping Docker container: " $SLURM_JOB_NAME
        docker ps -q --filter label=$SLURM_JOB_NAME | xargs -r docker stop
        docker ps -qa --filter label=$SLURM_JOB_NAME | xargs -r docker rm
    }
    trap cleanup SIGINT SIGTERM
    cleanup
    
    # Start Docker container
    docker pull $image
    docker run \
      --label $SLURM_JOB_NAME \
      --gpus "device=$CUDA_VISIBLE_DEVICES" \
      -v $share_dir \
      $image \
      /bin/bash -c "$run_cmd" &
    
    # Wait for task completion
    wait
    cleanup
  3. Query jobs through Slurm.

    Use the squeue command to query the list of jobs that are currently running or queued.

    squeue

    Use the sacct command to query job history, including completed jobs.

    sacct

Schedule Singularity jobs

  1. Docker2Singularity.

    To simplify Singularity image management, you can reuse Docker image repositories in the cloud. Follow these steps to convert image formats.

    # Method 1: Convert to sif image through local docker package
    [root@compute006 opt]# docker images
    REPOSITORY                                               TAG               IMAGE ID       CREATED       SIZE
    ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch   2.4.0-cuda12.1.1-py310-alinux3.2104   19301a07d7fd   4 months ago   6.33GB
    [root@compute006 opt]# docker save -o docker.tar 19301a07d7fd 
    [root@compute006 opt]# ll docker.tar 
    -rw------- 1 root root 3021202432 Feb  12 15:03 docker.tar
    [root@compute006 opt]# singularity build pytorch.sif docker-archive:///opt/docker.tar
    
    
    # Method 2: Build sif image through docker container repository
    singularity build xx.sif docker://ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.4.0-cuda12.1.1-py310-alinux3.2104

Expand to view Singularity script introduction

#!/bin/bash
#SBATCH --job-name=my_job_name
#SBATCH --output=output.log
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --ntasks=2

# args
image=your_image.sif
run_cmd="your cmd"
share_dir="/ehpcdata/:/mnt"

singularity exec --nv --bind $share_dir $image $run_cmd
Note

Script description:

  1. Singularity supports both root and non-root users. The user context remains unchanged when the container starts. The user environment is the same both inside and outside the image. This is different from Docker images, which require Slurm environment variables to be actively passed to the inside of the image.

  2. GPU ID does not support physical isolation. The application itself needs to complete GPU selection through the GPU ID (CUDA_VISIBLE_DEVICES) allocated by the scheduler, for example: "CUDA_VISIBLE_DEVICES=0 ./deviceQuery".

  3. The job script includes the following parts:

    1. Slurm scheduling parameters, including resource requirements, job name (required), input/output information, etc.

    2. Environment variable settings, including run commands and image address (repository or local). These are used as Docker startup parameters.

    3. Singularity startup command.

  4. Singularity exec command:

    1. --nv: Used to enable support for NVIDIA GPUs.

    2. --bind: Specifies the container shared directory. It is recommended to map the working directory including code and models to the container. Note: The container uses root privileges internally, and files created by default belong to root.

    3. Image & command: Sets the image name and execution command.

Submit jobs through E-HPC Portal

  1. Log on to E-HPC Portal.

  2. Submit an NCCL job.

    1. In the top navigation bar, select Job Management, click submitter at the top of the page, and on the Create Job page, set Number Of Compute Nodes to 1, Number Of Tasks to 2, and Number Of GPUs to 1.

    2. The job script content is as follows.

      image=/opt/pytorch.sif
      run_cmd="python main.py"
      share_dir="/home/usertest/:/root"
      
      singularity exec --nv --bind $share_dir $image $run_cmd
  3. Query jobs.

    Go to the Job Management page to view the job list, which includes job status, job operations, and more. For more information, see Query jobs.

Submit jobs through the command line

  1. Submit jobs through the command line. For more information, see SLURM.

  2. The job script content is as follows.

    #!/bin/bash
    #SBATCH --job-name=singularity_TF
    #SBATCH --output=output.log
    #SBATCH --nodes=1
    #SBATCH --gres=gpu:1
    #SBATCH --partition=container
    #SBATCH --ntasks=2
    
    
    image=/opt/pytorch.sif
    run_cmd="python main.py"
    share_dir="/home/usertest/:/root"
    
    singularity exec --nv --bind $share_dir $image $run_cmd
  3. Query jobs through Slurm.

    Use the squeue command to query the list of jobs that are currently running or queued.

    squeue

    Use the sacct command to query job history, including completed jobs.

    sacct