This topic describes the entire process from cluster creation to job scheduling for users who need to run containerized tasks in a SLURM cluster.
Step 1: Create a cluster
Create a standard SLURM cluster.
The following table describes the example cluster configuration used in this topic. Configure other parameters as needed.
Parameter
Configuration
Cluster Configuration
Series
Standard Edition
Deployment Mode
Public cloud cluster
Cluster Type
SLURM
Management Node
Instance Type: Use the ecs.r7.xlarge instance type, which has 4 vCPUs and 32 GiB of memory.
Image: centos_7_6_x64_20G_alibase_20211130.vhd
Compute Node and Queue
Queue Compute Nodes
Initial nodes:
1.Inter-node interconnection
VPC Network
Instance type Group
Instance Type: Use the ecs.gn7i-c56g1.14xlarge instance type, which has 56 vCPUs and 346 GiB of memory.
ImportantYou must use an instance type that supports GPUs. For more information, see Instance family.
Image: centos_7_6_x64_20G_alibase_20211130.vhd
Shared File Storage
/home Cluster Mount Directory
By default, the
/homeand/optdirectories of the control plane node are mounted to the file system as shared storage directories./opt Cluster Mount Directory
Software and Service Component
Software To Be Installed
Select docker.
Installable service components
Logon Node:
Instance Type: Use the ecs.r7.xlarge instance type, which has 4 vCPUs and 32 GiB of memory.
Image: centos_7_6_x64_20G_alibase_20211130.vhd.
In this topic, the usertest user is used as an example.
Step 2: Set up the basic software environment
Attach an Elastic IP Address to the compute node.
Download and install CUDA.
Download the CUDA installation package.
cd /opt wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.runInstall CUDA.
yum install -y git sh /opt/cuda_12.4.1_550.54.15_linux.runWhen the following figure appears, CUDA is installed.

Configure the environment variables.
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrcView the installation status and version information of the NVIDIA CUDA toolkit and GPU driver.
# Version information of the NVIDIA CUDA compiler driver nvcc --version # Detailed status information of the GPU nvidia-smiWhen the following figure appears, CUDA and the GPU driver work fine.

Install and configure NVIDIA Container Toolkit.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo sudo yum install -y nvidia-container-toolkit sudo systemctl restart dockerDownload and install Singularity.
Singularity is a containerization tool that allows you to run containers without changing the user environment. It is commonly used in HPC environments.
cd /opt wget https://public-ehs.oss-cn-hangzhou.aliyuncs.com/softwares/packages/CentOS_7.2_64/singularity-3.8.3-1.el7.x86_64.rpm yum install -y /opt/singularity-3.8.3-1.el7.x86_64.rpmCreate job dependency data.
Pull the PyTorch container image. There is no need to change anything for the command below.
docker pull ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.4.0-cuda12.1.1-py310-alinux3.2104Use the
usertestuser to create amain.pyfile.vim /home/usertest/main.pyThe script content of the
main.pyfile is as follows:# -*- coding: utf-8 -*- import torch import torchvision import torchvision.transforms as transforms from torch import nn from torch.utils.data import DataLoader from torch.optim import SGD class SimpleNet(nn.Module): def __init__(self): super(SimpleNet, self).__init__() self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1) self.fc1 = nn.Linear(64 * 8 * 8, 128) self.fc2 = nn.Linear(128, 10) self.pool = nn.MaxPool2d(2, 2) self.relu = nn.ReLU() self.dropout = nn.Dropout(0.5) def forward(self, x): x = self.pool(self.relu(self.conv1(x))) x = self.pool(self.relu(self.conv2(x))) x = x.view(-1, 64 * 8 * 8) x = self.relu(self.fc1(x)) x = self.dropout(x) x = self.fc2(x) return x device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}") transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True) test_loader = DataLoader(dataset=test_dataset, batch_size=64, shuffle=False) model = SimpleNet().to(device) criterion = nn.CrossEntropyLoss() optimizer = SGD(model.parameters(), lr=0.001, momentum=0.9) num_epochs = 10 for epoch in range(num_epochs): model.train() running_loss = 0.0 for i, data in enumerate(train_loader, 0): inputs, labels = data inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item() if i % 100 == 99: print(f"[{epoch + 1}, {i + 1}] loss: {running_loss / 100:.3f}") running_loss = 0.0 model.eval() correct = 0 total = 0 with torch.no_grad(): for data in test_loader: images, labels = data images, labels = images.to(device), labels.to(device) outputs = model(images) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() print(f"Accuracy on test set: {100 * correct / total:.2f}%") print("Training finished.")
Step 3: Schedule jobs
Schedule Docker jobs
Submit jobs through E-HPC Portal
Submit an NCCL job.
In the top navigation bar, select Task Management, click submitter at the top of the page, and on the Create Job page, set Number Of Nodes to
1, Number Of Tasks (per node) to2, and Number Of GPUs to1.The job script content is as follows.
Use the
docker imagescommand to obtain the image name and version number, and replaceyour_imagein the third line.#!/bin/bash image="your_image" run_cmd="python main.py" share_dir="/home/usertest/:/root" # cleanup docker handle function cleanup { echo "Caught signal, stopping Docker container: " $SLURM_JOB_NAME docker ps -q --filter label=$SLURM_JOB_NAME | xargs -r docker stop docker ps -qa --filter label=$SLURM_JOB_NAME | xargs -r docker rm } trap cleanup SIGINT SIGTERM cleanup # start docker # docker pull $image docker run \ --label $SLURM_JOB_NAME \ --gpus "device=0" \ -v $share_dir \ $image \ /bin/bash -c "$run_cmd" & # wait to complete wait cleanup
Query jobs.
Go to the Task Management page to view the job list with job status and actions. For more information, see Query jobs.
Submit jobs through the command line
Submit jobs through the command line. For more information, see SLURM.
The job script content is as follows.
Use the
docker imagescommand to obtain the image name and version number, and replaceyour_imagein the thirteenth line.#!/bin/bash #SBATCH --job-name=tf_sample_job #SBATCH --nodes=1 #SBATCH --nt #SBATCH --gpus-per-task=1 #SBATCH --time=01:00:00 #SBATCH --partition=comp #SBATCH --output=tf_sample_job_%j.out #SBATCH --error=tf_sample_job_%j.err # Define variables image="your_image" run_cmd="python main.py" share_dir="/home/usertest/:/root" # Clean up Docker processes function cleanup { echo "Caught signal, stopping Docker container: " $SLURM_JOB_NAME docker ps -q --filter label=$SLURM_JOB_NAME | xargs -r docker stop docker ps -qa --filter label=$SLURM_JOB_NAME | xargs -r docker rm } trap cleanup SIGINT SIGTERM cleanup # Start Docker container docker pull $image docker run \ --label $SLURM_JOB_NAME \ --gpus "device=$CUDA_VISIBLE_DEVICES" \ -v $share_dir \ $image \ /bin/bash -c "$run_cmd" & # Wait for task completion wait cleanupQuery jobs through Slurm.
Use the
squeuecommand to query the list of jobs that are currently running or queued.squeueUse the
sacctcommand to query job history, including completed jobs.sacct
Schedule Singularity jobs
Docker2Singularity.
To simplify Singularity image management, you can reuse Docker image repositories in the cloud. Follow these steps to convert image formats.
# Method 1: Convert to sif image through local docker package [root@compute006 opt]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch 2.4.0-cuda12.1.1-py310-alinux3.2104 19301a07d7fd 4 months ago 6.33GB [root@compute006 opt]# docker save -o docker.tar 19301a07d7fd [root@compute006 opt]# ll docker.tar -rw------- 1 root root 3021202432 Feb 12 15:03 docker.tar [root@compute006 opt]# singularity build pytorch.sif docker-archive:///opt/docker.tar # Method 2: Build sif image through docker container repository singularity build xx.sif docker://ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.4.0-cuda12.1.1-py310-alinux3.2104
Submit jobs through E-HPC Portal
Submit an NCCL job.
In the top navigation bar, select Task Management, click submitter at the top of the page, and on the Create Job page, set Number Of Nodes to
1, Number Of Tasks (per node) to2, and Number Of GPUs to1.The job script content is as follows:
image=/opt/pytorch.sif run_cmd="python main.py" share_dir="/home/usertest/:/root" singularity exec --nv --bind $share_dir $image $run_cmd
Query jobs.
Go to the Task Management page to view the job list with job status and actions. For more information, see Query jobs.
Submit jobs through the command line
Submit jobs through the command line. For more information, see SLURM.
The job script content is as follows:
#!/bin/bash #SBATCH --job-name=singularity_TF #SBATCH --output=output.log #SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --partition=container #SBATCH --ntasks=2 image=/opt/pytorch.sif run_cmd="python main.py" share_dir="/home/usertest/:/root" singularity exec --nv --bind $share_dir $image $run_cmdQuery jobs through Slurm.
Use the
squeuecommand to query the list of jobs that are currently running or queued.squeueUse the
sacctcommand to query job history, including completed jobs.sacct