An Object Storage Service (OSS) accelerator can significantly speed up model training thanks to faster data loading. This topic includes performance comparisons of data loading with and without using an OSS accelerator. These performance comparisons suggest that data loading efficiency is crucial for model training, particularly when GPUs have not yet reached their performance bottleneck. This topic also shows how to use an OSS accelerator on Elastic GPU Service to speed up the speed of fine-tuning a pretrained ResNet-18 on Imagenet ILSVRC datasets.
Acceleration performance
Compared with standard OSS access, an OSS accelerator provides a noticeable improvement in performance. An OSS accelerator reduces latency and enables high throughout based on a small number of workers. Performance tests demonstrate that OSS accelerators achieve an impressive performance improvement of 40% to 400% in model training. They significantly decrease computing resource consumption, reduce costs, and provide more cost-effective solutions.
Solution overview
The following flowchart illustrates the process of training the model on Elastic GPU Service with an OSS accelerator used.
Model training acceleration by using an OSS accelerator is a three-task procedure:
Create a GPU-accelerated instance in Elastic GPU Service. You need to create a GPU-accelerated instance that meets your model training requirements.
Create an OSS bucket and an OSS accelerator for the bucket. After you create the bucket and accelerator, record the internal endpoint of the bucket and the accelerated endpoint, which will be used in model training.
Train the model. Pre-process the datasets and upload the pre-processed datasets to the bucket. When you train the model, use the OSS accelerator to load the datasets to the local device.
Procedure
Task 1: Create a GPU-accelerated instance on Elastic GPU Service
The following steps show how to create and connect to a GPU-accelerated instance for model training. In this task, the instance type is ecs.gn6i-c4g1.xlarge, the operating system is Ubuntu 22.04, and the CUDA version is 12.4.1. When you use custom instance specifications, make sure that you use the latest CUDA version.
1. Create a GPU-accelerated instance
Click the Custom Launch tab.
Configure parameters for the instance based on your business requirements. The parameters include Billing Method, Region, Network and Zone, Instance Type, and Image. Complete the creation. For more information about the settings, see Parameter descriptions.
ImportantThe OSS accelerator feature is in public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), and Singapore. Make sure that your GPU-accelerated instance is in one of the regions. In this example, the GPU-accelerated instance is located in the China (Hangzhou) region.
In this example, the instance type used is ecs.gn6i-c4g1.xlarge.

In this example, the OS image is Ubuntu 22.04, the Auto-install GPU Driver check box is selected, and the selected CUDA version is 12.4.1. When the instance starts, CUDA is automatically installed.

2. Connect to the GPU-accelerated instance
On the Instance page in the ECS console, find the ECS instance that you created based on its region and ID. Then, click Connect in the Actions column.
In the Remote connection dialog box, click Sign in now in the Workbench section.
In the Instance Login dialog box, set Authentication to the authentication method that you selected when you created the GPU-accelerated instance, provide the required authentication information, and click Log On. For example, if you selected Key Pair for Logon Credential when you created the instance, you can select SSH Key Authentication as the authentication method, and upload the private key file or enter the content of the private key file.
NoteThe private key file was automatically downloaded to your on-premises computer when you created the key pair. Check the download history of your browser to find the private key file in the
.pemformat.If a page similar to the following one appears, you have logged in to the ECS instance and the CUDA driver is being automatically installed. Wait for the installation to complete.

Task 2: Create an OSS bucket and an OSS accelerator for the bucket
The following steps show how to create an OSS bucket in the same region as the GPU-accelerated instance for storing datasets, and create an OSS accelerator to accelerate dataset access. If the GPU-accelerated instance and the bucket reside in the same region and the internal endpoint is used for data access, no traffic fees are incurred.
Create a bucket and obtain the internal endpoint of the bucket
ImportantThe OSS accelerator feature is in public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), and Singapore. Make sure that the bucket is located in the region where the GPU-accelerated instance was created. In the previous task, a GPU-accelerated instance was created in the China (Hangzhou) region. As a result, the bucket must also be located in the China (Hangzhou) region.
On the Buckets page of the OSS console, click Create Bucket.
In the Create Bucket panel, following the on-screen information to complete the bucket creation.
Go to the Overview page of the bucket. From the Port section, record the endpoint for Access from ECS over the VPC (internal network). The endpoint will be used to upload datasets and checkpoints during model training.

Create an OSS accelerator and record the name of the accelerator
On the Buckets page of the OSS console, click the name of the bucket. In the left-side navigation tree, choose .
Click Create Accelerator, and in the Create Accelerator panel, set the capacity (500 GB in this example), then click Next.
Select Paths for Acceleration Policy and add the directory of the dataset in the bucket to the accelerated paths. Click OK, and follow the on-screen information to complete the creation process.

Record the accelerated endpoint, which will be used to download datasets from the bucket during model training.

Task 3: Train the model
The following steps cover model training processes, including the environment configuration, dataset upload, and acceleration with the OSS accelerator.
For the complete sample code, see demo.tar.gz.
All the subsequent steps must be performed as a root user. Make sure that you switch to the root user before you perform the subsequent steps.
Prepare the environment for model training
Prepare the conda environment and configure dependencies.
Run the following command to install conda:
curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh && bash /tmp/miniconda.sh -b -p /opt/conda/ && rm /tmp/miniconda.sh && /opt/conda/bin/conda clean -tipy && export PATH=/opt/conda/bin:$PATH && conda init bash && source ~/.bashrc && conda update condaRun the
vim environment.yamlcommand to create and open an environment configuration file namedenvironment.yaml. Add the following configurations to the environment configuration file and save the environment configuration file:name: py312 channels: - defaults - conda-forge - pytorch dependencies: - python=3.12 - pytorch>=2.5.0 - torchvision - torchaudio - transformers - torchdata - oss2Run the following command to create a conda environment named py312 based on the environment configuration file:
conda env create -f environment.yamlRun the
conda activate py312command to activate the py312 environment. The following figure shows that the environment is activated.
ImportantProceed with the following steps in the conda environment that is activated.
Configure environment variables.
Run the following commands to configure environment variables. Remember to replace
<ACCESS_KEY_ID>and<ACCESS_KEY_SECRET>with the AccessKey ID and AccessKey secret of the RAM user that you want to use. For information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey pair.export OSS_ACCESS_KEY_ID=<ACCESS_KEY_ID> export OSS_ACCESS_KEY_SECRET=<ACCESS_KEY_SECRET>Install and configure the OSS connector.
Run the following command to install the OSS connector:
pip install osstorchconnectorRun the following command to create a configuration file:
mkdir -p /root/.alibabacloud && touch /root/.alibabacloud/credentialsRun the
vim /root/.alibabacloud/credentialscommand to open the configuration file. Add the following configurations to the file, and then save the file. For more information about how to configure the OSS connector, see Configure OSS Connector for AI/ML.Replace the example AccessKey ID and AccessKey secret with your actual information.
For more information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey pair.{ "AccessKeyId": "LTAI************************", "AccessKeySecret": "At32************************" }Run the following command to make the credentials file read-only:
chmod 400 /root/.alibabacloud/credentialsRun the following command to create a configuration file for the OSS connector:
mkdir -p /etc/oss-connector/ && touch /etc/oss-connector/config.jsonRun the
vim /etc/oss-connector/config.jsoncommand to open the configuration file. Add the following configurations to the configuration file and save the configuration file. In most cases, you can use the default configurations.{ "logLevel": 1, "logPath": "/var/log/oss-connector/connector.log", "auditPath": "/var/log/oss-connector/audit.log", "datasetConfig": { "prefetchConcurrency": 24, "prefetchWorker": 2 }, "checkpointConfig": { "prefetchConcurrency": 24, "prefetchWorker": 4, "uploadConcurrency": 64 } }
Prepare data
Upload the training set and validation set to the bucket.
Run the following commands to download the training set and validation set to the ECS instance. Take note that the data used in this training task is only a portion of the entire dataset.
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241216/jsnenr/n04487081.tar wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241218/dxrciv/n10148035.tar wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241216/senwji/val.tarRun the following commands to extract datasets from the downloaded packages to the dataset directory created in the current path:
tar -zxvf n10148035.tar && tar -zxvf n04487081.tar && tar -zxvf val.tar mkdir dataset && mkdir ./dataset/train && mkdir ./dataset/val mv n04487081 ./dataset/train/ && mv n10148035 ./dataset/train/ && mv IL*.JPEG ./dataset/val/Run the
python3 upload_dataset.pycommand to run the script file, which uploads the datasets to the bucket.# upload_dataset.py from torchvision import transforms from PIL import Image import oss2 import os from oss2.credentials import EnvironmentVariableCredentialsProvider # In this example, the internal endpoint for the China (Hangzhou) region is used. OSS_ENDPOINT = "oss-cn-hangzhou-internal.aliyuncs.com" # The internal OSS endpoint. OSS_BUCKET_NAME = "<YourBucketName>" # The name of the bucket. BUCKET_REGION = "cn-hangzhou" # The ID of the region in which the bucket is located. # Specify a custom prefix in the names of the datasets in the bucket. OSS_URI_BASE = "dataset/imagenet/ILSVRC/Data" def to_tensor(img_path): IMG_DIM_224 = 224 compose = transforms.Compose([ transforms.RandomResizedCrop(IMG_DIM_224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) img = Image.open(img_path).convert('RGB') img_tensor = compose(img) numpy_data = img_tensor.numpy() binary_data = numpy_data.tobytes() return binary_data def list_dir(directory): for root, _, files in os.walk(directory): rel_root = os.path.relpath(root, start=directory) for file in files: rel_filepath = os.path.join(rel_root, file) if rel_root != '.' else file yield rel_filepath IMG_DIR_BASE = "./dataset" """ IMG_DIR_BASE stores the local path of the images. You can specify the local path by using an absolute or relative path. The structure of the local path must be consistent with that of the datasets: {IMG_DIR_BASE}/ train/ n10148035/ n10148035_10034.JPEG n10148035_10217.JPEG ... n11879895/ n11879895_10016.JPEG n11879895_10019.JPEG ... ... val/ ILSVRC2012_val_00000001.JPEG ILSVRC2012_val_00000002.JPEG ... """ bucket_api = oss2.Bucket(oss2.ProviderAuthV4(EnvironmentVariableCredentialsProvider()), OSS_ENDPOINT, OSS_BUCKET_NAME, region=BUCKET_REGION) for phase in [ "val", "train"]: IMG_DIR = "%s/%s" % (IMG_DIR_BASE, phase) for _, img_relative_path in enumerate(list_dir(IMG_DIR)): img_bin_name = img_relative_path.replace(".JPEG", ".pt") object_key = "%s/%s/%s" % (OSS_URI_BASE, phase, img_bin_name) bucket_api.put_object(object_key, to_tensor("%s/%s" % (IMG_DIR,img_relative_path)))
Download the files that store image data labels. The files are used to establish dataset mapping.
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241220/izpskr/imagenet_class_index.json wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241220/lfilrp/ILSVRC2012_val_labels.json
Start model training
Create a module that is used to process ImageNet datasets. The module uses the accelerated endpoint to download datasets from the cache and creates a data loader.
Create a module to initialize a pretrained ResNet18 model.
Create a module to train a ResNet model. This module trains a given model based on the specified data loaders and number of epochs.
Create a script file that integrates model training processes.
Run the
python3 main.pycommand to start the training. The following figure shows that the training has started.
Verify the result
On the Buckets page, click to check whether the checkpoints directory contains the resnet18.pt object. The following figure shows that checkpoints are uploaded to OSS.
