This topic describes the entire process from cluster creation to job scheduling for users who need to run containerized tasks in a SLURM cluster.
Step 1: Create a cluster
Create a Standard Edition SLURM cluster.
The following table describes the example cluster configuration used in this topic. Configure other parameters as needed.
Configuration item
Configuration
Cluster Configuration
Series
Standard Edition
Deployment Mode
Public Cloud Cluster
Cluster Type
SLURM
Control Plane Node
Instance Type: Use the ecs.r7.xlarge instance type, which has 4 vCPUs and 32 GiB of memory.
Image: centos_7_6_x64_20G_alibase_20211130.vhd
Compute Nodes And Queues
Number Of Queue Nodes
Initial nodes:
1
.Node Interconnection
ERDMA Network
Instance Type Group
Instance Type: Use the ecs.gn7i-c56g1.14xlarge instance type, which has 56 vCPUs and 346 GiB of memory.
ImportantYou must use an instance type that supports GPUs. For more information, see Instance family.
Image: centos_7_6_x64_20G_alibase_20211130.vhd
Shared File Storage
/home Cluster Mount Directory
By default, the
/home
and/opt
directories of the control plane node are mounted to the file system as shared storage directories./opt Cluster Mount Directory
Software And Service Components
Software To Install
Select docker.
Service Components Available For Installation
Logon Node:
Instance Type: Use the ecs.r7.xlarge instance type, which has 4 vCPUs and 32 GiB of memory.
Image: centos_7_6_x64_20G_alibase_20211130.vhd.
In this topic, the usertest user is used as an example.
Step 2: Set up the basic software environment
Attach an Elastic IP Address to the compute node. For more information, see Elastic IP Address.
Download and install CUDA.
Download the CUDA installation package.
cd /opt wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
Install CUDA.
yum install -y git sh /opt/cuda_12.4.1_550.54.15_linux.run
When the following figure appears, CUDA is installed.
Configure the environment variables.
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrc
View the installation status and version information of the NVIDIA CUDA toolkit and GPU driver.
# Version information of the NVIDIA CUDA compiler driver nvcc --version # Detailed status information of the GPU nvidia-smi
When the following figure appears, CUDA and the GPU driver are working properly.
Install and configure NVIDIA Container Toolkit.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo sudo yum install -y nvidia-container-toolkit sudo systemctl restart docker
Download and install Singularity.
Singularity is a containerization tool that allows you to run containers without changing the user environment. It is commonly used in HPC environments.
cd /opt wget https://public-ehs.oss-cn-hangzhou.aliyuncs.com/softwares/packages/CentOS_7.2_64/singularity-3.8.3-1.el7.x86_64.rpm yum install -y /opt/singularity-3.8.3-1.el7.x86_64.rpm
Create job dependency data.
Pull the PyTorch container image.
docker pull ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.4.0-cuda12.1.1-py310-alinux3.2104
Use the
usertest
user to create amain.py
file.vim /home/usertest/main.py
The script content of the
main.py
file is as follows.# -*- coding: utf-8 -*- import torch import torchvision import torchvision.transforms as transforms from torch import nn from torch.utils.data import DataLoader from torch.optim import SGD class SimpleNet(nn.Module): def __init__(self): super(SimpleNet, self).__init__() self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1) self.fc1 = nn.Linear(64 * 8 * 8, 128) self.fc2 = nn.Linear(128, 10) self.pool = nn.MaxPool2d(2, 2) self.relu = nn.ReLU() self.dropout = nn.Dropout(0.5) def forward(self, x): x = self.pool(self.relu(self.conv1(x))) x = self.pool(self.relu(self.conv2(x))) x = x.view(-1, 64 * 8 * 8) x = self.relu(self.fc1(x)) x = self.dropout(x) x = self.fc2(x) return x device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}") transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True) test_loader = DataLoader(dataset=test_dataset, batch_size=64, shuffle=False) model = SimpleNet().to(device) criterion = nn.CrossEntropyLoss() optimizer = SGD(model.parameters(), lr=0.001, momentum=0.9) num_epochs = 10 for epoch in range(num_epochs): model.train() running_loss = 0.0 for i, data in enumerate(train_loader, 0): inputs, labels = data inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item() if i % 100 == 99: print(f"[{epoch + 1}, {i + 1}] loss: {running_loss / 100:.3f}") running_loss = 0.0 model.eval() correct = 0 total = 0 with torch.no_grad(): for data in test_loader: images, labels = data images, labels = images.to(device), labels.to(device) outputs = model(images) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() print(f"Accuracy on test set: {100 * correct / total:.2f}%") print("Training finished.")
Step 3: Schedule jobs
Schedule Docker jobs
Submit jobs through E-HPC Portal
Submit an NCCL job.
In the top navigation bar, select Job Management, click submitter at the top of the page, and on the Create Job page, set Number Of Compute Nodes to
1
, Number Of Tasks to2
, and Number Of GPUs to1
.The job script content is as follows.
Use the
docker images
command to obtain the image name and version number, and replaceyour_image
in the third line.#!/bin/bash image="your_image" run_cmd="python main.py" share_dir="/home/usertest/:/root" # cleanup docker handle function cleanup { echo "Caught signal, stopping Docker container: " $SLURM_JOB_NAME docker ps -q --filter label=$SLURM_JOB_NAME | xargs -r docker stop docker ps -qa --filter label=$SLURM_JOB_NAME | xargs -r docker rm } trap cleanup SIGINT SIGTERM cleanup # start docker # docker pull $image docker run \ --label $SLURM_JOB_NAME \ --gpus "device=0" \ -v $share_dir \ $image \ /bin/bash -c "$run_cmd" & # wait to complete wait cleanup
Query jobs.
Go to the Job Management page to view the job list, which includes job status, job operations, and more. For more information, see Query jobs.
Submit jobs through the command line
Submit jobs through the command line. For more information, see SLURM.
The job script content is as follows.
Use the
docker images
command to obtain the image name and version number, and replaceyour_image
in the thirteenth line.#!/bin/bash #SBATCH --job-name=tf_sample_job #SBATCH --nodes=1 #SBATCH --nt #SBATCH --gpus-per-task=1 #SBATCH --time=01:00:00 #SBATCH --partition=comp #SBATCH --output=tf_sample_job_%j.out #SBATCH --error=tf_sample_job_%j.err # Define variables image="your_image" run_cmd="python main.py" share_dir="/home/usertest/:/root" # Clean up Docker processes function cleanup { echo "Caught signal, stopping Docker container: " $SLURM_JOB_NAME docker ps -q --filter label=$SLURM_JOB_NAME | xargs -r docker stop docker ps -qa --filter label=$SLURM_JOB_NAME | xargs -r docker rm } trap cleanup SIGINT SIGTERM cleanup # Start Docker container docker pull $image docker run \ --label $SLURM_JOB_NAME \ --gpus "device=$CUDA_VISIBLE_DEVICES" \ -v $share_dir \ $image \ /bin/bash -c "$run_cmd" & # Wait for task completion wait cleanup
Query jobs through Slurm.
Use the
squeue
command to query the list of jobs that are currently running or queued.squeue
Use the
sacct
command to query job history, including completed jobs.sacct
Schedule Singularity jobs
Docker2Singularity.
To simplify Singularity image management, you can reuse Docker image repositories in the cloud. Follow these steps to convert image formats.
# Method 1: Convert to sif image through local docker package [root@compute006 opt]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch 2.4.0-cuda12.1.1-py310-alinux3.2104 19301a07d7fd 4 months ago 6.33GB [root@compute006 opt]# docker save -o docker.tar 19301a07d7fd [root@compute006 opt]# ll docker.tar -rw------- 1 root root 3021202432 Feb 12 15:03 docker.tar [root@compute006 opt]# singularity build pytorch.sif docker-archive:///opt/docker.tar # Method 2: Build sif image through docker container repository singularity build xx.sif docker://ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.4.0-cuda12.1.1-py310-alinux3.2104
Submit jobs through E-HPC Portal
Submit an NCCL job.
In the top navigation bar, select Job Management, click submitter at the top of the page, and on the Create Job page, set Number Of Compute Nodes to
1
, Number Of Tasks to2
, and Number Of GPUs to1
.The job script content is as follows.
image=/opt/pytorch.sif run_cmd="python main.py" share_dir="/home/usertest/:/root" singularity exec --nv --bind $share_dir $image $run_cmd
Query jobs.
Go to the Job Management page to view the job list, which includes job status, job operations, and more. For more information, see Query jobs.
Submit jobs through the command line
Submit jobs through the command line. For more information, see SLURM.
The job script content is as follows.
#!/bin/bash #SBATCH --job-name=singularity_TF #SBATCH --output=output.log #SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --partition=container #SBATCH --ntasks=2 image=/opt/pytorch.sif run_cmd="python main.py" share_dir="/home/usertest/:/root" singularity exec --nv --bind $share_dir $image $run_cmd
Query jobs through Slurm.
Use the
squeue
command to query the list of jobs that are currently running or queued.squeue
Use the
sacct
command to query job history, including completed jobs.sacct