All Products
Search
Document Center

Object Storage Service:Use data in OSS to build an iterable dataset for sequential streaming reads

Last Updated:Mar 20, 2026

OssIterableDataset streams objects from OSS sequentially, making it well-suited for training jobs where memory is limited or datasets are too large to load in full. Unlike map-style datasets, it does not support random access or parallel shuffling across workers.

Prerequisites

Before you begin, make sure you have:

Choose a build method

OssIterableDataset provides three factory methods. Pick the one that matches your data layout:

MethodBest forWhen to use
from_prefixObjects under a common path prefixPaths follow a uniform naming pattern
from_objectsA known, scattered list of objectsPaths are explicit but spread across the bucket
from_manifest_fileVery large datasets (tens of millions of objects)Data indexing is enabled on the bucket and the dataset is loaded frequently; avoids repeated list API calls and reduces associated fees

Build a dataset

Build a dataset using a URI prefix

Use from_prefix when all your training objects share a common path prefix in OSS.

from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Build the dataset from a URI prefix. All objects under the prefix are included.
iterable_dataset = OssIterableDataset.from_prefix(
    OSS_URI,
    endpoint=ENDPOINT,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH
)

# Iterate over the dataset. Each item is a DataObject with key, size, and read().
for item in iterable_dataset:
    print(item.key)    # object key (path within the bucket)
    print(item.size)   # object size in bytes
    content = item.read()
    print(len(content))

Build a dataset from a list of URIs

Use from_objects when you have a specific list of object URIs, even if they are scattered across different paths.

from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"

# uris is a string iterator containing one or more OSS URIs.
uris = [
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png"
]

# Build the dataset from the URI list.
iterable_dataset = OssIterableDataset.from_objects(
    uris,
    endpoint=ENDPOINT,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH
)

# Iterate over the dataset. Each item is a DataObject with key, size, and read().
for item in iterable_dataset:
    print(item.key)
    print(item.size)
    content = item.read()
    print(len(content))

Build a dataset from a manifest file

Use from_manifest_file for large datasets with data indexing enabled. This method reads the object list from a pre-built manifest file instead of calling list APIs at runtime, which reduces both latency and fees.

Step 1: Create a manifest file.

Run touch manifest_file and populate it with one of the following formats.

Object names only:

Img/BadImag/Bmp/Sample001/img001-00001.png
Img/BadImag/Bmp/Sample001/img001-00002.png
Img/BadImag/Bmp/Sample001/img001-00003.png

Object names with labels:

Img/BadImag/Bmp/Sample001/img001-00001.png label1
Img/BadImag/Bmp/Sample001/img001-00002.png label2
Img/BadImag/Bmp/Sample001/img001-00003.png label3

Step 2: Build the dataset.

The following examples use imagenet_manifest_parser, the built-in parser for the manifest file. Both a local path and an OSS URI are supported for manifest_file_path.

import io
from typing import Iterable, Tuple, Union
from osstorchconnector import OssIterableDataset
from osstorchconnector import imagenet_manifest_parser

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_BASE_URI = "oss://ai-testset/EnglistImg/"

# Option 1: Load the manifest file from a local path.
# manifest_file_path: local path to the manifest file
# manifest_parser: parses each line into (oss_uri, label) tuples
# oss_base_uri: prepended to relative paths in the manifest to form full OSS URIs
MANIFEST_FILE_LOCAL = "/path/to/manifest_file.txt"
iterable_dataset = OssIterableDataset.from_manifest_file(
    manifest_file_path=MANIFEST_FILE_LOCAL,
    manifest_parser=imagenet_manifest_parser,
    oss_base_uri=OSS_BASE_URI,
    endpoint=ENDPOINT,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH
)
for item in iterable_dataset:
    print(item.key)
    print(item.size)
    print(item.label)
    content = item.read()
    print(len(content))

# Option 2: Load the manifest file from an OSS URI.
MANIFEST_FILE_URI = "oss://manifest_fileai-testset/EnglistImg/manifest_file"
iterable_dataset = OssIterableDataset.from_manifest_file(
    manifest_file_path=MANIFEST_FILE_URI,
    manifest_parser=imagenet_manifest_parser,
    oss_base_uri=OSS_BASE_URI,
    endpoint=ENDPOINT,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH
)

Create a PyTorch DataLoader

Pass the dataset to torch.utils.data.DataLoader to use it in a training loop.

import torch
from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

def transform(obj):
    data = obj.read()
    return obj.key, obj.label

iterable_dataset = OssIterableDataset.from_prefix(
    OSS_URI,
    endpoint=ENDPOINT,
    transform=transform,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH
)

# Create a DataLoader backed by the iterable dataset.
loader = torch.utils.data.DataLoader(
    iterable_dataset,
    batch_size=256,
    num_workers=32,
    prefetch_factor=2
)

# Use the loader in the training loop.
for batch in loader:
    # Perform training operations.
    ...

Parameters

The following tables list all parameters for OssIterableDataset. Parameters marked as common apply to all three factory methods.

Common parameters

ParameterTypeRequiredDescription
endpointstringYesThe OSS endpoint used to access the bucket. See Regions and endpoints.
cred_pathstringYesPath to the credentials file. Default: /root/.alibabacloud/credentials. See Configure access credentials.
config_pathstringYesPath to the OSS Connector for AI/ML configuration file. Default: /etc/oss-connector/config.json. See Configure OSS Connector for AI/ML.
transformobjectNoA callable applied to each DataObject before it is returned. If not specified, the default identity function is used, which returns obj.copy().
Important

Do not return the DataObject directly from a transform function — the iterator may fail. Return a copy using obj.copy() or extract the data you need (for example, raw bytes or a tensor).

Method-specific parameters

ParameterMethodTypeRequiredDescription
oss_urifrom_prefixstringYesOSS URI prefix used to select objects. Must start with oss://.
object_urisfrom_objectsstringYesOne or more OSS URIs identifying specific objects. Must start with oss://.
manifest_file_pathfrom_manifest_filestringYesPath to the manifest file. Accepts a local file path or an OSS URI starting with oss://.
manifest_parserfrom_manifest_fileCallable ObjectYesA function that parses the manifest file and returns an iterator of (oss_uri, label) tuples. Use the built-in imagenet_manifest_parser or provide a custom implementation. See manifest_parser.
oss_base_urifrom_manifest_filestringYesBase OSS URI prepended to relative paths in the manifest to form complete OSS URIs. Pass "" if the manifest already contains full URIs.

Built-in methods

transform

When a dataset is iterated, each DataObject is passed through the transform function before being returned.

Default method

If no transform is specified, the following identity function is used:

# Default transform: returns a copy of the DataObject.
def identity(obj: DataObject) -> DataObject:
    if obj is not None:
        return obj.copy()
    else:
        return None

Custom method

To apply preprocessing — for example, decoding an image and normalizing pixel values — pass a custom function:

import sys
import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Define the image preprocessing pipeline.
trans = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Custom transform: decode the image and apply preprocessing.
# Returns a (tensor, label) tuple — not the DataObject itself.
def transform(obj):
    try:
        img = Image.open(io.BytesIO(obj.read())).convert('RGB')
        val = trans(img)
    except Exception as e:
        raise e
    return val, obj.label

iterable_dataset = OssIterableDataset.from_prefix(
    OSS_URI,
    endpoint=ENDPOINT,
    transform=transform,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH
)

manifest_parser

manifest_parser parses the manifest file and returns an iterator of (oss_uri, label) tuples consumed by from_manifest_file.

Import the built-in parser:

from osstorchconnector import imagenet_manifest_parser

The built-in implementation reads tab-separated lines. Lines with two or more fields yield (key, label); lines with one field yield (key, '').

def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
    lines = reader.read().decode("utf-8").strip().split("\n")
    for i, line in enumerate(lines):
        try:
            items = line.strip().split('\t')
            if len(items) >= 2:
                key = items[0]
                label = items[1]
                yield (key, label)
            elif len(items) == 1:
                key = items[0]
                yield (key, '')
            else:
                raise ValueError("format error")
        except ValueError as e:
            logging.error(f"Error: {e} for line {i}: {line}")

To support a different manifest format, implement a function with the same signature: accept an open file-like object and return an iterable of (oss_uri, label) tuples.

What's next