Use an OSS accelerator to speed up model training - Object Storage Service

An Object Storage Service (OSS) accelerator can significantly speed up model training thanks to faster data loading. This topic includes performance comparisons of data loading with and without using an OSS accelerator. These performance comparisons suggest that data loading efficiency is crucial for model training, particularly when GPUs have not yet reached their performance bottleneck. This topic also shows how to use an OSS accelerator on Elastic GPU Service to speed up the speed of fine-tuning a pretrained ResNet-18 on Imagenet ILSVRC datasets.

Acceleration performance

Compared with standard OSS access, an OSS accelerator provides a noticeable improvement in performance. An OSS accelerator reduces latency and enables high throughout based on a small number of workers. Performance tests demonstrate that OSS accelerators achieve an impressive performance improvement of 40% to 400% in model training. They significantly decrease computing resource consumption, reduce costs, and provide more cost-effective solutions.

Performance test results

Important

The following performance test results are for reference. Actual acceleration performance varies with a vast set of factors, such as the dataset size, hardware performance, model complexity, and hyperparameter settings.

The performance tests evaluate performance of model training with and without an OSS accelerator. The datasets used to training the model include a training set of 1,280,000 images and a validation set of 50,000 images. The tests are performed by using the same machine specifications (4 vCPUs + 15 GB memory + 1 × Tesla T4) and multiple concurrency settings. The following table shows the test results.

Batch size	Number of workers	Average training time per epoch (min)
Batch size	Number of workers	OSS accelerator not used	OSS accelerator used
64	6	63.18	34.70
	4	54.96	34.68
	2	146.05	34.66
32	6	82.19	37.11
	4	108.33	37.13
	2	137.87	37.30
16	6	68.93	41.58
	4	132.97	41.69
	2	206.32	41.69

Solution overview

The following flowchart illustrates the process of training the model on Elastic GPU Service with an OSS accelerator used.

Model training acceleration by using an OSS accelerator is a three-task procedure:

Create a GPU-accelerated instance in Elastic GPU Service. You need to create a GPU-accelerated instance that meets your model training requirements.
Create an OSS bucket and an OSS accelerator for the bucket. After you create the bucket and accelerator, record the internal endpoint of the bucket and the accelerated endpoint, which will be used in model training.
Train the model. Pre-process the datasets and upload the pre-processed datasets to the bucket. When you train the model, use the OSS accelerator to load the datasets to the local device.

Procedure

Task 1: Create a GPU-accelerated instance on Elastic GPU Service

The following steps show how to create and connect to a GPU-accelerated instance for model training. In this task, the instance type is ecs.gn6i-c4g1.xlarge, the operating system is Ubuntu 22.04, and the CUDA version is 12.4.1. When you use custom instance specifications, make sure that you use the latest CUDA version.

1. Create a GPU-accelerated instance

Go to the Elastic Compute Service (ECS) instance buy page.
Click the Custom Launch tab.
Configure parameters for the instance based on your business requirements. The parameters include Billing Method, Region, Network and Zone, Instance Type, and Image. Complete the creation. For more information about the settings, see Parameter descriptions.
Important
The OSS accelerator feature is in public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), and Singapore. Make sure that your GPU-accelerated instance is in one of the regions. In this example, the GPU-accelerated instance is located in the China (Hangzhou) region.
- In this example, the instance type used is ecs.gn6i-c4g1.xlarge.
- In this example, the OS image is Ubuntu 22.04, the Auto-install GPU Driver check box is selected, and the selected CUDA version is 12.4.1. When the instance starts, CUDA is automatically installed.

2. Connect to the GPU-accelerated instance

On the Instance page in the ECS console, find the ECS instance that you created based on its region and ID. Then, click Connect in the Actions column.
In the Remote connection dialog box, click Sign in now in the Workbench section.
In the Instance Login dialog box, set Authentication to the authentication method that you selected when you created the GPU-accelerated instance, provide the required authentication information, and click Log On. For example, if you selected Key Pair for Logon Credential when you created the instance, you can select SSH Key Authentication as the authentication method, and upload the private key file or enter the content of the private key file.
Note
The private key file was automatically downloaded to your on-premises computer when you created the key pair. Check the download history of your browser to find the private key file in the .pem format.
If a page similar to the following one appears, you have logged in to the ECS instance and the CUDA driver is being automatically installed. Wait for the installation to complete.

Task 2: Create an OSS bucket and an OSS accelerator for the bucket

The following steps show how to create an OSS bucket in the same region as the GPU-accelerated instance for storing datasets, and create an OSS accelerator to accelerate dataset access. If the GPU-accelerated instance and the bucket reside in the same region and the internal endpoint is used for data access, no traffic fees are incurred.

Create a bucket and obtain the internal endpoint of the bucket
Important
The OSS accelerator feature is in public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), and Singapore. Make sure that the bucket is located in the region where the GPU-accelerated instance was created. In the previous task, a GPU-accelerated instance was created in the China (Hangzhou) region. As a result, the bucket must also be located in the China (Hangzhou) region.
1. On the Buckets page of the OSS console, click Create Bucket.
2. In the Create Bucket panel, following the on-screen information to complete the bucket creation.
3. Go to the Overview page of the bucket. From the Port section, record the endpoint for Access from ECS over the VPC (internal network). The endpoint will be used to upload datasets and checkpoints during model training.
Create an OSS accelerator and record the name of the accelerator
1. On the Buckets page of the OSS console, click the name of the bucket. In the left-side navigation tree, choose Bucket Settings > OSS Accelerator.
2. Click Create Accelerator, and in the Create Accelerator panel, set the capacity (500 GB in this example), then click Next.
3. Select Paths for Acceleration Policy and add the directory of the dataset in the bucket to the accelerated paths. Click OK, and follow the on-screen information to complete the creation process.
4. Record the accelerated endpoint, which will be used to download datasets from the bucket during model training.

Task 3: Train the model

The following steps cover model training processes, including the environment configuration, dataset upload, and acceleration with the OSS accelerator.

Note

For the complete sample code, see demo.tar.gz.
All the subsequent steps must be performed as a root user. Make sure that you switch to the root user before you perform the subsequent steps.

Prepare the environment for model training
1. Prepare the conda environment and configure dependencies.
  1. Run the following command to install conda:
```
curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh && bash /tmp/miniconda.sh -b -p /opt/conda/ && rm /tmp/miniconda.sh && /opt/conda/bin/conda clean -tipy && export PATH=/opt/conda/bin:$PATH  && conda init bash && source ~/.bashrc && conda update conda 
```
  2. Run the vim environment.yaml command to create and open an environment configuration file named environment.yaml. Add the following configurations to the environment configuration file and save the environment configuration file:
```
name: py312
channels:
  - defaults
  - conda-forge
  - pytorch
dependencies:
  - python=3.12
  - pytorch>=2.5.0
  - torchvision 
  - torchaudio 
  - transformers 
  - torchdata
  - oss2
```
  3. Run the following command to create a conda environment named py312 based on the environment configuration file:
```
conda env create -f environment.yaml
```
  4. Run the conda activate py312 command to activate the py312 environment. The following figure shows that the environment is activated.
    Important
    Proceed with the following steps in the conda environment that is activated.
2. Configure environment variables.
  Run the following commands to configure environment variables. Remember to replace <ACCESS_KEY_ID> and <ACCESS_KEY_SECRET> with the AccessKey ID and AccessKey secret of the RAM user that you want to use. For information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey pair.
```
export OSS_ACCESS_KEY_ID=<ACCESS_KEY_ID>
export OSS_ACCESS_KEY_SECRET=<ACCESS_KEY_SECRET>
```
3. Install and configure the OSS connector.
  1. Run the following command to install the OSS connector:
```
pip install osstorchconnector
```
  2. Run the following command to create a configuration file:
```
mkdir -p /root/.alibabacloud && touch /root/.alibabacloud/credentials
```
  3. Run the vim /root/.alibabacloud/credentials command to open the configuration file. Add the following configurations to the file, and then save the file. For more information about how to configure the OSS connector, see Configure OSS Connector for AI/ML.
    Replace the example AccessKey ID and AccessKey secret with your actual information. For more information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey pair.
```
{
  "AccessKeyId": "LTAI************************",
  "AccessKeySecret": "At32************************"
}
```
  4. Run the following command to make the credentials file read-only:
```
chmod 400 /root/.alibabacloud/credentials
```
  5. Run the following command to create a configuration file for the OSS connector:
```
mkdir -p /etc/oss-connector/ && touch /etc/oss-connector/config.json
```
  6. Run the vim /etc/oss-connector/config.json command to open the configuration file. Add the following configurations to the configuration file and save the configuration file. In most cases, you can use the default configurations.
```
{
    "logLevel": 1,
    "logPath": "/var/log/oss-connector/connector.log",
    "auditPath": "/var/log/oss-connector/audit.log",
    "datasetConfig": {
        "prefetchConcurrency": 24,
        "prefetchWorker": 2
    },
    "checkpointConfig": {
        "prefetchConcurrency": 24,
        "prefetchWorker": 4,
        "uploadConcurrency": 64
    }
}
```

Prepare data

Upload the training set and validation set to the bucket.

Run the following commands to download the training set and validation set to the ECS instance. Take note that the data used in this training task is only a portion of the entire dataset.

wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241216/jsnenr/n04487081.tar
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241218/dxrciv/n10148035.tar
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241216/senwji/val.tar

Run the following commands to extract datasets from the downloaded packages to the dataset directory created in the current path:

tar -zxvf n10148035.tar && tar -zxvf n04487081.tar && tar -zxvf val.tar
mkdir dataset && mkdir ./dataset/train && mkdir ./dataset/val
mv n04487081 ./dataset/train/ && mv n10148035 ./dataset/train/ && mv IL*.JPEG ./dataset/val/

Run the python3 upload_dataset.py command to run the script file, which uploads the datasets to the bucket.

# upload_dataset.py

from torchvision import transforms
from PIL import Image
import oss2
import os
from oss2.credentials import EnvironmentVariableCredentialsProvider

# In this example, the internal endpoint for the China (Hangzhou) region is used.
OSS_ENDPOINT = "oss-cn-hangzhou-internal.aliyuncs.com"    # The internal OSS endpoint. 
OSS_BUCKET_NAME = "<YourBucketName>"    # The name of the bucket. 
BUCKET_REGION = "cn-hangzhou"    # The ID of the region in which the bucket is located. 

# Specify a custom prefix in the names of the datasets in the bucket.
OSS_URI_BASE = "dataset/imagenet/ILSVRC/Data"

def to_tensor(img_path):
    IMG_DIM_224 = 224
    compose = transforms.Compose([
            transforms.RandomResizedCrop(IMG_DIM_224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])
    img = Image.open(img_path).convert('RGB')
    img_tensor = compose(img)
    numpy_data = img_tensor.numpy()
    binary_data = numpy_data.tobytes()
    return binary_data

def list_dir(directory):
    for root, _, files in os.walk(directory):
        rel_root = os.path.relpath(root, start=directory)
        for file in files:
            rel_filepath = os.path.join(rel_root, file) if rel_root != '.' else file
            yield rel_filepath
IMG_DIR_BASE = "./dataset" 
"""
    IMG_DIR_BASE stores the local path of the images. You can specify the local path by using an absolute or relative path.
    The structure of the local path must be consistent with that of the datasets:
    {IMG_DIR_BASE}/
        train/
            n10148035/
                n10148035_10034.JPEG
                n10148035_10217.JPEG
                ... 
            n11879895/
                n11879895_10016.JPEG
                n11879895_10019.JPEG
                ...
            ...
        val/
            ILSVRC2012_val_00000001.JPEG
            ILSVRC2012_val_00000002.JPEG
            ...
"""

bucket_api = oss2.Bucket(oss2.ProviderAuthV4(EnvironmentVariableCredentialsProvider()), OSS_ENDPOINT, OSS_BUCKET_NAME, region=BUCKET_REGION)
        
for phase in [ "val", "train"]:
    IMG_DIR = "%s/%s" % (IMG_DIR_BASE, phase)
    for _, img_relative_path in enumerate(list_dir(IMG_DIR)):
        img_bin_name = img_relative_path.replace(".JPEG", ".pt")
        object_key = "%s/%s/%s" % (OSS_URI_BASE, phase, img_bin_name)
        bucket_api.put_object(object_key, to_tensor("%s/%s" % (IMG_DIR,img_relative_path)))

Download the files that store image data labels. The files are used to establish dataset mapping.

wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241220/izpskr/imagenet_class_index.json
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241220/lfilrp/ILSVRC2012_val_labels.json

Start model training

Create a module that is used to process ImageNet datasets. The module uses the accelerated endpoint to download datasets from the cache and creates a data loader.

oss_dataloader.py

# oss_dataloader.py

import json
import numpy as np
from torch.utils.data import DataLoader
import torch

class ImageCls():
    def __init__(self):
        self.__syn_to_class = {}
        self.__syn_to_label = {}
        with open("imagenet_class_index.json", "rb") as f:
            cls_list = json.load(f)
            for cls, v in cls_list.items():
                syn = v[0]
                label = v[1]
                self.__syn_to_class[syn] = int(cls)
                self.__syn_to_label[int(cls)] = label

    def __len__(self):
        return len(self.__syn_to_label)
    
    def __getitem__(self, syn):
        cls = self.__syn_to_class[syn]
        return cls

class ImageValSet():
    def __init__(self):
        self.__val_to_syn = {}
        with open("ILSVRC2012_val_labels.json", "rb") as f:
            val_syn_list = json.load(f)
            for val, syn in val_syn_list.items():
                self.__val_to_syn[val] = syn
    
    def __getitem__(self, val):
        return self.__val_to_syn[val]

imageCls = ImageCls()
imageValSet = ImageValSet()


IMG_DIM_224 = 224
OSS_URI_BASE = "oss://<YourBucketName>/dataset/imagenet/ILSVRC/Data"

# Specify the accelerated OSS endpoint to download datasets. Replace the endpoint with your actual information. 
ENDPOINT = "cn-hangzhou-j-internal.oss-data-acc.aliyuncs.com" 

def obj_to_tensor(object):
    data = object.read()
    numpy_array_from_binary = np.frombuffer(data, dtype=np.float32).reshape([3, IMG_DIM_224, IMG_DIM_224])
    return torch.from_numpy(numpy_array_from_binary)

def train_tensor_transform(object):
    tensor_from_binary = obj_to_tensor(object)
    key = object.key
    syn = key.split('/')[-2]
    
    return tensor_from_binary, imageCls[syn]

def val_tensor_transform(object):
    tensor_from_binary = obj_to_tensor(object)
    key = object.key
    image_name = key.split('/')[-1].split('.')[0] + ".JPEG"
    return tensor_from_binary, imageCls[imageValSet[image_name]]


def make_oss_dataloader(dataset, batch_size, num_worker, shuffle):
    image_datasets = {
        'train': dataset.from_prefix(OSS_URI_BASE + "/train/", endpoint=ENDPOINT, transform=train_tensor_transform),
        'val': dataset.from_prefix(OSS_URI_BASE + "/val/", endpoint=ENDPOINT, transform=val_tensor_transform),
    }
    dataloaders = {
        'train': DataLoader(image_datasets['train'], batch_size=batch_size, shuffle=shuffle, num_workers=num_worker),
        'val': DataLoader(image_datasets['val'], batch_size=batch_size, shuffle=shuffle, num_workers=num_worker)
    }
    
    return dataloaders

Create a module to initialize a pretrained ResNet18 model.

pre_trained_model.py

# pre_trained_model.py

from torchvision import models
import torch.nn as nn
import torch

def make_resnet_model(cls_count=1000):
    device = torch.device("cuda:0")
    model = models.resnet18(pretrained=True)
    num_ftrs = model.fc.in_features
    model.fc = nn.Linear(num_ftrs, cls_count)
    
    model = model.to(device)
    if torch.cuda.device_count() > 1:
        model = nn.DataParallel(model)
    
    return model, device

Create a module to train a ResNet model. This module trains a given model based on the specified data loaders and number of epochs.

resnet_train.py

# resnet_train.py

from osstorchconnector import OssCheckpoint
import torch.optim as optim
import torch
import torch.nn as nn

OSS_CHECKPOINT_URI = "oss://<YourBucketName>/checkpoints/resnet18.pt"

# Specify the internal OSS endpoint. 
ENDPOINT = "oss-cn-hangzhou-internal.aliyuncs.com" 

def train(model, dataloaders, device, epoch_num):
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
    criterion = nn.CrossEntropyLoss()
    
    best_acc = 0.0
    for epoch in range(epoch_num):
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()
            else:
                model.eval()

            running_loss = 0.0
            running_corrects = 0

            # Traverse through data.
            dataset_size = 0
            for (inputs, labels) in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # Perform backpropagation and optimization only in the training phase.
                    if phase == 'train':
                        optimizer.zero_grad()
                        loss.backward()
                        optimizer.step()

                # Collect statistics.
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                dataset_size += inputs.size(0)
            
            if phase == 'train':
                exp_lr_scheduler.step()

            epoch_loss = running_loss / dataset_size
            epoch_acc = running_corrects / dataset_size

            print(f'[Epoch {epoch}/{epoch_num - 1}][{phase}] {dataset_size} imgs {epoch_acc}')

            
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                # Upload checkpoints to OSS.
                checkpoint = OssCheckpoint(endpoint=ENDPOINT)
                with checkpoint.writer(OSS_CHECKPOINT_URI) as checkpoint_writer:
                    torch.save(model.state_dict(), checkpoint_writer)

Create a script file that integrates model training processes.

main.py

# main.py

from oss_dataloader import make_oss_dataloader
from pre_trained_model import make_resnet_model
from osstorchconnector import OssMapDataset
from resnet_train import train

# Define basic training parameters.
NUM_EPOCHS = 30 # epoch number
BATCH_SIZE = 64 # batch size
NUM_WORKER = 4 # dataloader worker number

# Use the pretrained ResNet18 model.
model, device = make_resnet_model()

# Use the OssMapDataset dataset to construct a Dataloader.
dataloaders = make_oss_dataloader(OssMapDataset, BATCH_SIZE, NUM_WORKER, True)

# Call the train function.
train(model, dataloaders, device, NUM_EPOCHS)

Run the python3 main.py command to start the training. The following figure shows that the training has started.

Verify the result
On the Buckets page, click Object Management > Objects to check whether the checkpoints directory contains the resnet18.pt object. The following figure shows that checkpoints are uploaded to OSS.