OssIterableDataset streams objects from OSS sequentially, making it well-suited for training jobs where memory is limited or datasets are too large to load in full. Unlike map-style datasets, it does not support random access or parallel shuffling across workers.
Prerequisites
Before you begin, make sure you have:
OSS Connector for AI/ML installed. See Install OSS Connector for AI/ML.
OSS Connector for AI/ML configured. See Configure OSS Connector for AI/ML.
Choose a build method
OssIterableDataset provides three factory methods. Pick the one that matches your data layout:
| Method | Best for | When to use |
|---|---|---|
from_prefix | Objects under a common path prefix | Paths follow a uniform naming pattern |
from_objects | A known, scattered list of objects | Paths are explicit but spread across the bucket |
from_manifest_file | Very large datasets (tens of millions of objects) | Data indexing is enabled on the bucket and the dataset is loaded frequently; avoids repeated list API calls and reduces associated fees |
Build a dataset
Build a dataset using a URI prefix
Use from_prefix when all your training objects share a common path prefix in OSS.
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Build the dataset from a URI prefix. All objects under the prefix are included.
iterable_dataset = OssIterableDataset.from_prefix(
OSS_URI,
endpoint=ENDPOINT,
cred_path=CRED_PATH,
config_path=CONFIG_PATH
)
# Iterate over the dataset. Each item is a DataObject with key, size, and read().
for item in iterable_dataset:
print(item.key) # object key (path within the bucket)
print(item.size) # object size in bytes
content = item.read()
print(len(content))Build a dataset from a list of URIs
Use from_objects when you have a specific list of object URIs, even if they are scattered across different paths.
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
# uris is a string iterator containing one or more OSS URIs.
uris = [
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png"
]
# Build the dataset from the URI list.
iterable_dataset = OssIterableDataset.from_objects(
uris,
endpoint=ENDPOINT,
cred_path=CRED_PATH,
config_path=CONFIG_PATH
)
# Iterate over the dataset. Each item is a DataObject with key, size, and read().
for item in iterable_dataset:
print(item.key)
print(item.size)
content = item.read()
print(len(content))Build a dataset from a manifest file
Use from_manifest_file for large datasets with data indexing enabled. This method reads the object list from a pre-built manifest file instead of calling list APIs at runtime, which reduces both latency and fees.
Step 1: Create a manifest file.
Run touch manifest_file and populate it with one of the following formats.
Object names only:
Img/BadImag/Bmp/Sample001/img001-00001.png
Img/BadImag/Bmp/Sample001/img001-00002.png
Img/BadImag/Bmp/Sample001/img001-00003.pngObject names with labels:
Img/BadImag/Bmp/Sample001/img001-00001.png label1
Img/BadImag/Bmp/Sample001/img001-00002.png label2
Img/BadImag/Bmp/Sample001/img001-00003.png label3Step 2: Build the dataset.
The following examples use imagenet_manifest_parser, the built-in parser for the manifest file. Both a local path and an OSS URI are supported for manifest_file_path.
import io
from typing import Iterable, Tuple, Union
from osstorchconnector import OssIterableDataset
from osstorchconnector import imagenet_manifest_parser
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_BASE_URI = "oss://ai-testset/EnglistImg/"
# Option 1: Load the manifest file from a local path.
# manifest_file_path: local path to the manifest file
# manifest_parser: parses each line into (oss_uri, label) tuples
# oss_base_uri: prepended to relative paths in the manifest to form full OSS URIs
MANIFEST_FILE_LOCAL = "/path/to/manifest_file.txt"
iterable_dataset = OssIterableDataset.from_manifest_file(
manifest_file_path=MANIFEST_FILE_LOCAL,
manifest_parser=imagenet_manifest_parser,
oss_base_uri=OSS_BASE_URI,
endpoint=ENDPOINT,
cred_path=CRED_PATH,
config_path=CONFIG_PATH
)
for item in iterable_dataset:
print(item.key)
print(item.size)
print(item.label)
content = item.read()
print(len(content))
# Option 2: Load the manifest file from an OSS URI.
MANIFEST_FILE_URI = "oss://manifest_fileai-testset/EnglistImg/manifest_file"
iterable_dataset = OssIterableDataset.from_manifest_file(
manifest_file_path=MANIFEST_FILE_URI,
manifest_parser=imagenet_manifest_parser,
oss_base_uri=OSS_BASE_URI,
endpoint=ENDPOINT,
cred_path=CRED_PATH,
config_path=CONFIG_PATH
)Create a PyTorch DataLoader
Pass the dataset to torch.utils.data.DataLoader to use it in a training loop.
import torch
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
def transform(obj):
data = obj.read()
return obj.key, obj.label
iterable_dataset = OssIterableDataset.from_prefix(
OSS_URI,
endpoint=ENDPOINT,
transform=transform,
cred_path=CRED_PATH,
config_path=CONFIG_PATH
)
# Create a DataLoader backed by the iterable dataset.
loader = torch.utils.data.DataLoader(
iterable_dataset,
batch_size=256,
num_workers=32,
prefetch_factor=2
)
# Use the loader in the training loop.
for batch in loader:
# Perform training operations.
...Parameters
The following tables list all parameters for OssIterableDataset. Parameters marked as common apply to all three factory methods.
Common parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
endpoint | string | Yes | The OSS endpoint used to access the bucket. See Regions and endpoints. |
cred_path | string | Yes | Path to the credentials file. Default: /root/.alibabacloud/credentials. See Configure access credentials. |
config_path | string | Yes | Path to the OSS Connector for AI/ML configuration file. Default: /etc/oss-connector/config.json. See Configure OSS Connector for AI/ML. |
transform | object | No | A callable applied to each DataObject before it is returned. If not specified, the default identity function is used, which returns obj.copy(). |
Do not return the DataObject directly from a transform function — the iterator may fail. Return a copy using obj.copy() or extract the data you need (for example, raw bytes or a tensor).
Method-specific parameters
| Parameter | Method | Type | Required | Description |
|---|---|---|---|---|
oss_uri | from_prefix | string | Yes | OSS URI prefix used to select objects. Must start with oss://. |
object_uris | from_objects | string | Yes | One or more OSS URIs identifying specific objects. Must start with oss://. |
manifest_file_path | from_manifest_file | string | Yes | Path to the manifest file. Accepts a local file path or an OSS URI starting with oss://. |
manifest_parser | from_manifest_file | Callable Object | Yes | A function that parses the manifest file and returns an iterator of (oss_uri, label) tuples. Use the built-in imagenet_manifest_parser or provide a custom implementation. See manifest_parser. |
oss_base_uri | from_manifest_file | string | Yes | Base OSS URI prepended to relative paths in the manifest to form complete OSS URIs. Pass "" if the manifest already contains full URIs. |
Built-in methods
transform
When a dataset is iterated, each DataObject is passed through the transform function before being returned.
Default method
If no transform is specified, the following identity function is used:
# Default transform: returns a copy of the DataObject.
def identity(obj: DataObject) -> DataObject:
if obj is not None:
return obj.copy()
else:
return NoneCustom method
To apply preprocessing — for example, decoding an image and normalizing pixel values — pass a custom function:
import sys
import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Define the image preprocessing pipeline.
trans = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Custom transform: decode the image and apply preprocessing.
# Returns a (tensor, label) tuple — not the DataObject itself.
def transform(obj):
try:
img = Image.open(io.BytesIO(obj.read())).convert('RGB')
val = trans(img)
except Exception as e:
raise e
return val, obj.label
iterable_dataset = OssIterableDataset.from_prefix(
OSS_URI,
endpoint=ENDPOINT,
transform=transform,
cred_path=CRED_PATH,
config_path=CONFIG_PATH
)manifest_parser
manifest_parser parses the manifest file and returns an iterator of (oss_uri, label) tuples consumed by from_manifest_file.
Import the built-in parser:
from osstorchconnector import imagenet_manifest_parserThe built-in implementation reads tab-separated lines. Lines with two or more fields yield (key, label); lines with one field yield (key, '').
def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
lines = reader.read().decode("utf-8").strip().split("\n")
for i, line in enumerate(lines):
try:
items = line.strip().split('\t')
if len(items) >= 2:
key = items[0]
label = items[1]
yield (key, label)
elif len(items) == 1:
key = items[0]
yield (key, '')
else:
raise ValueError("format error")
except ValueError as e:
logging.error(f"Error: {e} for line {i}: {line}")To support a different manifest format, implement a function with the same signature: accept an open file-like object and return an iterable of (oss_uri, label) tuples.
What's next
For details on the properties and I/O methods available on each dataset item, see Data type in OSS Connector for AI/ML.
To use OSS Connector for AI/ML in a containerized training environment, see Build a Docker image that contains an OSS Connector for AI/ML environment.