Random I/O on millions of small files — such as ImageNet images — is a leading bottleneck for GPU utilization during AI/ML training. By combining OSS Connector for AI/ML with OSS Accelerator, you can significantly boost data loading speed.
How the acceleration works
OSS Connector for AI/ML and OSS Accelerator optimize at different layers — application and storage, respectively — and produce a compounding effect when combined:
-
OSS Connector for AI/ML (application layer): Uses asynchronous I/O and multi-threaded prefetching to convert serial file requests into highly concurrent parallel streams. The next batch's data is prefetched in the background, eliminating CPU/GPU idle time spent waiting on I/O.
-
OSS Accelerator (storage layer): Caches hot data on high-performance storage media using a cold/warm cache model. On the first read (cold cache), data is loaded from the OSS origin into the cache, so performance is comparable to standard OSS access. On subsequent reads (warm cache), data is served directly from the cache — roughly 2.8x lower P50 latency compared to the OSS origin (approximately 12 ms vs. 35 ms). In model training, where the same dataset is read across multiple epochs, acceleration takes effect starting from the second epoch.
-
Combined effect: The high volume of concurrent requests from the Connector is absorbed by the Accelerator at millisecond speed, removing the OSS origin latency as a bottleneck. Traffic bursts during training startup or epoch transitions are smoothed out by the cache, ensuring consistently high throughput.
Prerequisites
-
An OSS bucket with your training dataset uploaded.
-
An AccessKey ID and AccessKey Secret. For more information, see Create an AccessKey pair.
-
An ECS instance. Compute-optimized or network-optimized instances (such as ecs.g7.32xlarge) are recommended. Alibaba Cloud Linux 3/4 is the recommended operating system. For more information, see Custom launch ECS instances.
Enable OSS Accelerator for your bucket
-
Log on to the OSS console and click the target bucket name.
-
In the left-side navigation pane, choose Bucket Settings > OSS Accelerator.
-
Click Create OSS Accelerator and configure the following parameters:
Parameter
Description
Zone
Select the same zone as your ECS instance. In this example, China (Beijing) Zone H is used.
Capacity
Must be greater than or equal to the total size of your dataset. In this example, 20 TB is used.
Acceleration Policy
Select Accelerate Specified Path and enter the dataset prefix, or select Accelerate Entire Bucket.
ImportantThe ECS instance and OSS Accelerator should be in the same zone. Cross-zone access introduces additional network latency that degrades acceleration performance.
-
After creation, the endpoint to use with OSS Connector for AI/ML follows the format
oss-cache-<zone>.aliyuncs.com. For example, if the zone iscn-beijing-h, the corresponding endpoint isoss-cache-cn-beijing-h.aliyuncs.com.
Install and configure OSS Connector for AI/ML
-
Log on to the ECS instance and install OSS Connector for AI/ML (PyTorch edition).
pip install osstorchconnector -
Configure access credentials. Replace
<yourAccessKeyId>and<yourAccessKeySecret>with your actual AccessKey information.mkdir -p /root/.alibabacloud cat > /root/.alibabacloud/credentials << 'EOF' { "AccessKeyId": "<yourAccessKeyId>", "AccessKeySecret": "<yourAccessKeySecret>" } EOFThe credentials file must use JSON format. For more configuration options, see Configure OSS Connector for AI/ML.
-
Create a configuration file for OSS Connector to tune prefetch parameters for small file reads.
mkdir -p /etc/oss-connector/ cat > /etc/oss-connector/config.json << 'EOF' { "logLevel": 1, "logPath": "/var/log/oss-connector/connector.log", "auditPath": "/var/log/oss-connector/audit.log", "datasetConfig": { "prefetchMB": 1024, "prefetchConcurrency": 16, "prefetchWorker": 2, "prefetchUnitMB": 1, "timeoutMs": 10000 }, "checkpointConfig": { "prefetchConcurrency": 24, "prefetchWorker": 4, "uploadConcurrency": 64 } } EOFKey parameters:
Parameter
Recommended value
Description
prefetchMB1024
Prefetch buffer size in MB. A 1 GB buffer can hold approximately 8,700 files at 115 KB each. Increase this value for larger files.
prefetchConcurrency16
Number of concurrent prefetch operations. Takes full advantage of high-bandwidth instances.
prefetchUnitMB1
Size of each prefetch unit in MB. Set to 1 MB for small files. Increase to match the file size for larger files.
Run a performance comparison test
Create a test script named test_accelerator.py that reads the dataset using both the OSS internal endpoint and the OSS Accelerator endpoint, then compares the results.
from osstorchconnector import OssMapDataset
import torch
from torch.utils.data import DataLoader
import time
import numpy as np
# === Configuration ===
# First run: use OSS internal endpoint. Second run: switch to OSS Accelerator endpoint.
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
# ENDPOINT = "http://oss-cache-cn-beijing-h.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://<yourBucketName>/<yourDatasetPrefix>/"
REGION = "cn-beijing"
def collate_fn(batch):
results = []
for item in batch:
start = time.perf_counter()
content = item.read()
read_time = time.perf_counter() - start
results.append({'size': item.size, 'read_time': read_time})
return results
dataset = OssMapDataset.from_prefix(
OSS_URI,
endpoint=ENDPOINT,
cred_path=CRED_PATH,
config_path=CONFIG_PATH,
region=REGION
)
NUM_WORKERS = 8
BATCH_SIZE = 256
dataloader = DataLoader(
dataset,
batch_size=BATCH_SIZE,
num_workers=NUM_WORKERS,
collate_fn=collate_fn,
pin_memory=False,
shuffle=True
)
# === Run test ===
all_batch_durations = []
total_size = 0
file_count = 0
start_time = time.perf_counter()
last_receive_time = start_time
print(f"Starting test: {NUM_WORKERS} workers, batch_size={BATCH_SIZE}")
print(f"Endpoint: {ENDPOINT}\n")
try:
for batch in dataloader:
current_receive_time = time.perf_counter()
batch_duration = current_receive_time - last_receive_time
all_batch_durations.append(batch_duration)
last_receive_time = current_receive_time
batch_total_size = sum(data['size'] for data in batch)
total_size += batch_total_size
file_count += len(batch)
if file_count % 10000 == 0:
elapsed = current_receive_time - start_time
throughput = total_size / elapsed / (1024 ** 2)
print(f" {file_count:,} files processed | Throughput: {throughput:.2f} MB/s")
except KeyboardInterrupt:
print("\nTest interrupted")
# === Results ===
end_time = time.perf_counter()
total_elapsed = end_time - start_time
num_batches = len(all_batch_durations)
if num_batches > 0:
durations_ms = np.array(all_batch_durations) * 1000
print(f"\n{'='*50}")
print(f"Test Results")
print(f"{'='*50}")
print(f"Total files: {file_count:,}")
print(f"Total data: {total_size / (1024**2):,.2f} MB")
print(f"Total time: {total_elapsed:.2f} s")
print(f"Avg throughput: {total_size / total_elapsed / (1024**2):,.2f} MB/s")
print(f"\nBatch latency (ms): Avg={np.mean(durations_ms):.2f} P50={np.percentile(durations_ms, 50):.2f} P95={np.percentile(durations_ms, 95):.2f}")
else:
print("No batches received")
Replace OSS_URI and REGION in the script with your actual values, then follow the steps below.
1. Run with the OSS internal endpoint (baseline)
Make sure ENDPOINT in the script is set to the OSS internal address, then run the test and record the total time:
python test_accelerator.py
2. Switch to the OSS Accelerator endpoint
Change ENDPOINT to the accelerator address:
ENDPOINT = "http://oss-cache-cn-beijing-h.aliyuncs.com"
The first read through the accelerator loads data from the OSS origin into the cache (cold cache), so performance will be similar to direct OSS access. You need to run the test twice: the first run warms the cache, and the second run shows the actual acceleration.
# First run: warm up the cache
python test_accelerator.py
# Second run: cache hits, acceleration takes effect
python test_accelerator.py
Compare the total time of the baseline run against the second accelerator run to see the speedup for small file workloads.
Performance comparison results
The following results were measured on an ecs.g7.32xlarge instance using the ImageNet dataset (1.28 million files, 137 GB total, ~115 KB average file size):
|
Metric |
Without Accelerator |
With Accelerator |
|
Total time |
246.89 s |
101.49 s |
|
Average throughput |
567.44 MB/s |
1,380.41 MB/s |
|
Speedup |
— |
2.44x |
Actual performance varies depending on instance type, dataset size, and file size. We recommend running your own tests for accurate benchmarks.